Kaggle is a site that hosts data science competitions. Individuals and teams compete for prizes to solve problems using data science and machine learning. There are currently 16 active competitions and over 250 completed competitions.
On April 14, 1912, the RMS Titanic struck an iceberg in the North Atlantic Ocean and sank. Of the 2,224 people on board, only 706 survived.
The goal of this exercise is to predict survivors on the Titanic based on nine input variables, described below. We are provided two datasets: (1) train.csv, containing 891 records and (2) test.csv, containing 418 records. The two datasets are provided with the intent that models are formulated using the train dataset and model performance is evaluated on the test dataset.
| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket | class 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in year | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q |
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|
| 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | Q | |
| 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | S | |
| 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | Q | |
| 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | S | |
| 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | S | |
| 897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.2250 | S |
The first step is to find any and all missing data in the train and test sets.
# go through each variable and if it's an empty factor or numeric NA, sum each column
nlTrainNA = sapply( train, function(x) switch( class(x), factor = sum(x==""), sum( is.na(x) ) ) );
tTrainNA = t(nlTrainNA); # transpose named list
dfTrainNA = data.frame( tTrainNA ); # convert to dataframe
The train.csv dataset had 3 columns with missing values: Age, Cabin, and Embarked. Age is likely to be an important predictor of survival and we have data for 80% of the training subjects so imputing the missing values is likely to be beneficial. The source of embarkation may not have obvious predictive power, but given that we have data for over 99% of the training subjects, imputing the missing values could provide value. The Cabin variable is missing from over 77% of the test subjects. At first impression, this variable seems like an unlikely candidate to impute values since so much source data is missing.
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 177 | 0 | 0 | 0 | 0 | 687 | 2 |
nlTestNA = sapply( test, function(x) switch( class(x), factor = sum(x==""), sum( is.na(x) ) ) );
tTestNA = t(nlTestNA); # transpose named list
dfTestNA = data.frame( tTestNA ); # convert to dataframe
The test.csv dataset also had 3 columns with missing values: Age, Fare, and Cabin. Like the train dataset, the Cabin variable is sparse, with over 78% subjects missing values. The Age variable is populated for over 79% of the train subjects, and likely has good predictive power so it will likely be beneficial to impute values for subjects missing Age. The Fare variable is missing for a single subject. While Fare may not be an obvious predictor for survival, the fact that the dataset is over 99% complete for this variable indicates that it is a good candidate for imputation.
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 86 | 0 | 0 | 0 | 1 | 327 | 0 |
PassengerId is a primary key for each row of data in the train and test sets. This variable will not be included in any of the predictive models.
Survived is the class we’re trying to predict.
Passenger class is either 1st, 2nd, or 3rd.
ggplot(train, aes(x = factor(Pclass), fill = factor(Pclass))) +
geom_bar( show.legend=FALSE) +
xlab("Pclass") +
ylab("Total Count")
The Pclass variable shows that most passengers in the train set held a 3rd class ticket. 491 of the 891 passengers were 3rd class, more than 1st and 2nd class combined.
ggplot(train, aes(x = factor(Pclass), fill = factor(Survived))) +
geom_bar(width = 0.5, position="dodge") +
xlab("Pclass") +
ylab("Total Count") +
labs(fill = "Survived")
If the Survived variable is plotted as a function of passenger class, it appears that Pclass will be a predictor for survivability. A higher percentage of first class passengers survived than died, contrary to the overall trend, whereas a far higher percentage of 3rd class passengers died than survived.
It may seem that the passenger name is a lot like the PassengerId in that each name acts as a sort of primary key into the data and using name as a model feature would not generalize well. However, the name field could possibly provide value. The first twenty names in the train dataset:
head(as.character(train$Name),n=20);
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
## [7] "McCarthy, Mr. Timothy J"
## [8] "Palsson, Master. Gosta Leonard"
## [9] "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
## [10] "Nasser, Mrs. Nicholas (Adele Achem)"
## [11] "Sandstrom, Miss. Marguerite Rut"
## [12] "Bonnell, Miss. Elizabeth"
## [13] "Saundercock, Mr. William Henry"
## [14] "Andersson, Mr. Anders Johan"
## [15] "Vestrom, Miss. Hulda Amanda Adolfina"
## [16] "Hewlett, Mrs. (Mary D Kingcome) "
## [17] "Rice, Master. Eugene"
## [18] "Williams, Mr. Charles Eugene"
## [19] "Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)"
## [20] "Masselmani, Mrs. Fatima"
Each name begins with a surname before a comma and a title. If the passenger is a married woman, her maiden name appears in parentheses. Some of the titles may help infer age (Master. and Miss.) and surnames could help determine extended family travelling together, even if they’ve purchased separate tickets and are not in the same cabin. Additionally, the presence of diacritical marks in a name could indicate that the passenger is a non-English speaker who might have had difficulty understanding instructions or the gravity of the situation.
Sex is an unordered factor, male or female.
ggplot(train, aes(x = factor(Sex), fill = factor(Sex))) +
geom_bar( show.legend=FALSE) +
xlab("Sex") +
ylab("Total Count")
The Sex variable shows that most passengers in the training set were male (nearly 2/3 male). 577 of the 891 passengers were male, or 65%.
ggplot(train, aes(x = factor(Sex), fill = factor(Survived))) +
geom_bar(width = 0.5, position="dodge") +
xlab("Sex") +
ylab("Total Count") +
labs(fill = "Survived")
If the Survived variable is plotted as a function of Sex, it appears that Sex will be a strong predictor for survivability. Of the 314 female passengers in the training set, 233, or 74% survived. On the other hand, of the 577 male passengers, only 109 survived, or 19%.
As noted, there are age values for 80% of the training subjects, missing for 177 passengers. The distribution of ages is slightly skewed right, with a median of 28 years and a mean of 29.7 years.
ggplot(subset(train, !is.na(Age)), aes(x = Age)) +
geom_histogram(binwidth=4) +
xlab("Age") +
ylab("Total Count");
ggplot(subset(train,!is.na(Age)), aes(y=Age,x="")) + geom_boxplot();
Age as a predictor of survivability:
ggplot(subset(train,!is.na(Age)), aes(x = Age, fill = factor(Survived))) +
geom_density( position="stack") +
xlab("Age") +
ylab("Total Count") +
labs(fill = "Survived")
If we plot the Survived value as a function of the Age density, we see that there is a higher likelihood of younger passengers surviving over older passengers. Up until the mid-to-late teens, a training set passenger is more likely to survive than die, so age is likely to be a useful predictor for survivability.
This variable is unique in that it is combination of number of siblings or “1” if the passenger had a spouse on board the ship. In some cases, it is not clear what the SibSp variable is encoding when a “1” is found - is the passenger travelling with a sibling or a spouse? A SibSp of 2 or more is indicative of siblings. It will likely be beneficial to disambiguate this variable into separate ‘Siblings’ and ‘Spouses’ variables.
ggplot(train, aes(x = SibSp )) +
geom_histogram(binwidth=0.5) +
xlab("Number of Siblings/Spouses") +
ylab("Total Count")
Most passengers in the training set (68%) had neither a spouse or sibling on board. 23% of the training set passengers had a single sibling or spouse on board. The remaining 9% of the passengers had two or more (presumably) siblings on board.
ggplot(train, aes(x = SibSp, fill = factor(Survived))) +
geom_histogram(binwidth=0.5, position="dodge") +
xlab("Number of Siblings/Spouses") +
ylab("Total Count") +
labs(fill = "Survived")
Survivability does appear to trend with the number of siblings/spouses on board. A passenger having no siblings or spouses is most likely to have died, whereas a passenger with one or two siblings/spouses has around a 50% likelihood of surviving. The remaining cases of three to eight siblings are likely too few from which to draw inferences individually, so it might make sense to pool the SibSp values as follows: 0, 1, 2, 3> to avoid overfitting to specific training cases. A close examination of the seven instances of the SibSp variable in which SibSp equals 8 reveals that all the subjects were from the same family and were in the same cabin. Predicting that all families of size 8 will perish is unlikley to generalize well.
Similar to SibSp, this variable convolves two separate pieces of data: the number of parents and the number of children this passenger has on board. In some cases, it is not clear what the Parch variable is encoding when a “1” is found - is the passenger travelling with a parent or a child? This can be inferred if the Age variable is present for the passenger, but if the Age is missing, it will be ambiguous and may need further analysis. Perhaps the passenger’s title (Mr., Miss., Master) could help. Like SibSp, it will likely be beneficial to disambiguate this variable into separate ‘Parents’ and ‘Children’ variables.
ggplot(train, aes(x = Parch )) +
geom_histogram(binwidth=0.5) +
xlab("Number of Parents/Children") +
ylab("Total Count")
Most passengers in the training set (76%) had neither a parent or child on board. 13% of the training set passengers had a single parent or child on board. About 9% of the passengers in the training set (80) had two or more parents or children on board. The remaining 2% of the passengers had either 3, 4, 5, or 6 (presumably) children on board.
ggplot(train, aes(x = Parch, fill = factor(Survived))) +
geom_histogram(binwidth=0.5, position="dodge") +
xlab("Number of Parents/Children") +
ylab("Total Count") +
labs(fill = "Survived")
Survivability does appear to trend with the number of parents/children on board. A passenger having no parents or children is most likely to have died, whereas a passenger with one or two parents/children has around a 50% likelihood of surviving. The remaining cases of three to six children are likely too few from which to draw inferences individually, so it might make sense to group the Parch values as follows: 0, 1, 2, 3>= to avoid overfitting to specific training cases.
The entries in the Ticket column do not seem to be of a uniform format. Some ticket entries are just numbers - ranging from 693-392096. Other ticket entries have character prefixes like “C.A.” or “SOTON/O2”, followed by a (presumably) ticket number.
head(as.character(train$Ticket),n=20);
## [1] "A/5 21171" "PC 17599" "STON/O2. 3101282"
## [4] "113803" "373450" "330877"
## [7] "17463" "349909" "347742"
## [10] "237736" "PP 9549" "113783"
## [13] "A/5. 2151" "347082" "350406"
## [16] "248706" "382652" "244373"
## [19] "345763" "2649"
An inspection of the set of tickets shows that presumed families tend to share a single ticket number. The first impression is that the ticket number seems an unlikely predictor for survivability and could lead to overfitting the training set. However, the ticket number might help populate the missing Cabin information - and Cabin might be a good predictor for survivability.
The fare (price paid per ticket) ranges from 0 to 512.3292. The units are unclear, but are likely in English pounds. The distribution is skewed to the right, with a median of 14.4542 and a mean of 32.20421. A log transform of the data may be necessary to normalize the distribution of fares. However, first the fare for passenger must be determined. It appears to be the case that individual ticket numbers are not assigned per passenger, but rather a single ticket number is given to the purchaser of an allotment of tickets. That is, families travelling together seem to be under the same ticket with the same fare. So, it may be necessary to get to create an “Amount Paid per Passenger” feature that takes into account the number of people for which a fare was purchased on a single ticket.
ggplot(train, aes(x = Fare)) +
geom_histogram(binwidth=4) +
xlab("Fare") +
ylab("Total Count");
ggplot(train, aes(y=Fare,x="")) + geom_boxplot();
Fare could conceivably be an important factor in determining survivability. Perhaps the higher paying passengers received the first opportunity to board lifeboats. Or perhaps, those higher paying passengers were more initially unwilling to leave their more comfortable accomodations for the plebian conditions aboard a lifeboat. Fare as a predictor of survivability:
ggplot(train, aes(x = Fare, fill = factor(Survived))) +
geom_density( position="stack") +
xlab("Fare") +
ylab("Total Count") +
labs(fill = "Survived")
The Cabin feature could be another strong predictor for survivability. Perhaps cabins located nearest the lifeboats afforded the best survivability. But, the Cabin variable has many empty values. The empty values could mean that the information was not captured or it could mean that not all passengers received cabins and stayed in other accomodations. Being assigned a cabin could be a proxy for one’s social status and wealth. If so, the Pclass variable might be co-linear.
levels(train$Cabin);
## [1] "" "A10" "A14"
## [4] "A16" "A19" "A20"
## [7] "A23" "A24" "A26"
## [10] "A31" "A32" "A34"
## [13] "A36" "A5" "A6"
## [16] "A7" "B101" "B102"
## [19] "B18" "B19" "B20"
## [22] "B22" "B28" "B3"
## [25] "B30" "B35" "B37"
## [28] "B38" "B39" "B4"
## [31] "B41" "B42" "B49"
## [34] "B5" "B50" "B51 B53 B55"
## [37] "B57 B59 B63 B66" "B58 B60" "B69"
## [40] "B71" "B73" "B77"
## [43] "B78" "B79" "B80"
## [46] "B82 B84" "B86" "B94"
## [49] "B96 B98" "C101" "C103"
## [52] "C104" "C106" "C110"
## [55] "C111" "C118" "C123"
## [58] "C124" "C125" "C126"
## [61] "C128" "C148" "C2"
## [64] "C22 C26" "C23 C25 C27" "C30"
## [67] "C32" "C45" "C46"
## [70] "C47" "C49" "C50"
## [73] "C52" "C54" "C62 C64"
## [76] "C65" "C68" "C7"
## [79] "C70" "C78" "C82"
## [82] "C83" "C85" "C86"
## [85] "C87" "C90" "C91"
## [88] "C92" "C93" "C95"
## [91] "C99" "D" "D10 D12"
## [94] "D11" "D15" "D17"
## [97] "D19" "D20" "D21"
## [100] "D26" "D28" "D30"
## [103] "D33" "D35" "D36"
## [106] "D37" "D45" "D46"
## [109] "D47" "D48" "D49"
## [112] "D50" "D56" "D6"
## [115] "D7" "D9" "E10"
## [118] "E101" "E12" "E121"
## [121] "E17" "E24" "E25"
## [124] "E31" "E33" "E34"
## [127] "E36" "E38" "E40"
## [130] "E44" "E46" "E49"
## [133] "E50" "E58" "E63"
## [136] "E67" "E68" "E77"
## [139] "E8" "F E69" "F G63"
## [142] "F G73" "F2" "F33"
## [145] "F38" "F4" "G6"
## [148] "T"
The cabin name mostly adheres to the rule of a single letter A-F,G,T, followed by a number up to 3 digits. There are cases where a passenger has multiple cabins, each separated by whitespace. The beginning letter of each cabin could denote a deck or particular region of the ship - which could help with predicting survivability. Alternatively, the number of the cabin could be more informative than the beginning letter. Perhaps cabins “A19” and “B19” are located right next to one another, for instance.
cabinLetter = ifelse(train$Cabin == "", NA, substr(train$Cabin,1,1));
cabinREs = gregexpr("\\d+",train$Cabin, perl=TRUE);
cnMatches = regmatches( train$Cabin, cabinREs);
cabinNumber = numeric(length(cnMatches));
for ( i in 1:length(cnMatches) ){
cabinNumber[i] = mean( as.numeric( unlist( cnMatches[i] )));
}
edaTrain$CabinLetter <- as.factor(cabinLetter);
edaTrain$CabinNumber <- cabinNumber;
ggplot( edaTrain, aes(x=CabinNumber,y=Survived,color=Survived ) ) +
geom_point( shape=1,position=position_jitter(height=0.25)) +
ggtitle("Survivability by Cabin Number") +
xlab("Cabin Number") +
ylab("Survived");
ggplot(subset(edaTrain, !is.na(cabinLetter)), aes(x = CabinLetter, fill = as.factor(Survived))) +
geom_bar() +
ggtitle("Survivability by Ticket Letter") +
xlab("Cabin Letter") +
ylab("Total Count") +
labs(fill = "Survived");
When grouping the passengers by cabin number, there does not appear to be a relationship where survivability depends on cabin number. If so, there should be identifiable pockets of clusters where there is a higher incidence of survivability. If such clusters appeared to exist, the clusters could be defined and the group of clusters tested with a Chi-Square test to measure if survivability depends on cabin cluster.
It’s not immediately obvious if there is a benefit to categorizing the cabins according to their first letter. Are these groups statistically different from one another? A Chi-Square test of independence should show if survivability is dependent on cabin letter or not.
tbl = table( edaTrain$Survived, edaTrain$CabinLetter);
tbl
##
## A B C D E F G T
## 0 8 12 24 8 8 5 2 1
## 1 7 35 35 25 24 8 2 0
chisq.test(tbl);
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 10.301, df = 7, p-value = 0.1722
The p-value is 0.17 so at a confidence level of 0.05, we cannot reject the null hypothesis that survivability is independent of the starting cabin letter.
edaTrain$CabinAssignment[ edaTrain$Cabin != "" ] <- "Assigned";
edaTrain$CabinAssignment[ edaTrain$Cabin == "" ] <- "Unassigned";
data_combined$CabinAssignment[ data_combined$Cabin != "" ] <- "Assigned";
data_combined$CabinAssignment[ data_combined$Cabin == "" ] <- "Unassigned";
data_combined$CabinAssignment = factor( data_combined$CabinAssignment );
ggplot(edaTrain, aes(x=CabinAssignment, fill=factor(Survived))) +
geom_bar() +
facet_wrap(~Pclass) +
ggtitle("Pclass") +
xlab("Cabin Assignment") +
ylab("Total Count") +
labs(fill="Survived");
From the graph, it appears that for each passenger class, if a passenger is assigned a cabin, their chances of surviving the disaster are better than if they had not been assigned a cabin.
Passengers boarded the Titanic from one of three ports: (S)outhampton, England; (C)herbourg, France, or (Q)ueenstown, Ireland. As noted, two passengers in the training set have no record of port of Embarkation.
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 62 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38 | 0 | 0 | 113572 | 80 | B28 | |
| 830 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62 | 0 | 0 | 113572 | 80 | B28 |
The two passengers are both female and in first class.
ggplot( edaTrain, aes(x=factor(Pclass),fill=factor(Survived)))+
geom_bar() +
facet_wrap(~Embarked) +
ggtitle("Survivability by Port of Embarkation and Passenger Class") +
xlab("Passenger Class") +
ylab("Count") +
labs(fill="Survived");
Errors in the input dataset are often not apparent until deeper analyses are performed, such as feature engineering and imputing missing data. However, once the data is corrected, it is necessary to regenerate the columns and feature analyses that will feed the predictive models. In this dataset, it was found that a 16-year-old member of a family was incorrectly identified as being the father:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | CabinAssignment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 87 | 87 | 0 | 3 | Ford, Mr. William Neal | male | 16 | 1 | 3 | W./C. 6608 | 34.375 | S | Unassigned | |
| 148 | 148 | 0 | 3 | Ford, Miss. Robina Maggie “Ruby” | female | 9 | 2 | 2 | W./C. 6608 | 34.375 | S | Unassigned | |
| 437 | 437 | 0 | 3 | Ford, Miss. Doolina Margaret “Daisy” | female | 21 | 2 | 2 | W./C. 6608 | 34.375 | S | Unassigned | |
| 737 | 737 | 0 | 3 | Ford, Mrs. Edward (Margaret Ann Watson) | female | 48 | 1 | 3 | W./C. 6608 | 34.375 | S | Unassigned | |
| 1059 | 1059 | NA | 3 | Ford, Mr. Edward Watson | male | 18 | 2 | 2 | W./C. 6608 | 34.375 | S | Unassigned |
In the Ford family, Mr. William Neal Ford has the value ‘1’ for SibSp and ‘3’ for Parch. The value of ‘1’ for SibSp would imply that William has either one spouse or one sibling. In addition, William has a value of ‘3’ for Parch. This indicates that William has either 3 children, 1 parent and 2 children, 2 parents and 1 child, or 3 parents, which is clearly not possible according to the variable definition. Since two of the children in the family, Miss. Doolina Margaret Ford (21), and Mr. Edward Watson Ford (18) are actually older than William, it is clear that he cannot be their father. The other child, Miss. Robina Maggie Ford (9) is only 7 years younger than William, also indicating that William cannot be her father. The matriarch of the family is Mrs. Edward Ford - not Mrs. William Neal Ford, which indicates that it’s likely not a simple case of getting William Neal Ford’s age wrong (for instance, 46 instead of 16).
Therefore, the best fix would be to change the Parch variables for Miss. Robina Maggie Ford, Miss. Doolina Margaret Ford, and Mr. Edward Watson Ford from ‘2’ to ‘1’ (due to the inclusion of Mr. William Neal as a brother). In addition, the Parch variable of Mrs. Edward Ford would rise from ‘3’ to ‘4’, indicating she’s travelling with four of her children and the Parch variable of Mr. William Neal Ford would change from ‘3’ to ‘1’ indicating that he’s travelling with a single parent, his mother. The SibSp variables would change from ‘2’ to ‘3’ for Miss. Robina Maggie Ford, Miss. Doolina Margaret Ford, and Mr. Edward Watson Ford, indicating that each of them are travelling with 3 siblings. Similarly, Mr. William Neal Ford’s SibSp variable would change from ‘1’ to ‘3’. Finally, Mrs. Edward Ford’s SibSp would change from ‘1’ to ‘0’, indicating she was travelling without her husband (Mr. Edward Ford). The changes appear below.
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | CabinAssignment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 87 | 87 | 0 | 3 | Ford, Mr. William Neal | male | 16 | 3 | 1 | W./C. 6608 | 34.375 | S | Unassigned | |
| 148 | 148 | 0 | 3 | Ford, Miss. Robina Maggie “Ruby” | female | 9 | 3 | 1 | W./C. 6608 | 34.375 | S | Unassigned | |
| 437 | 437 | 0 | 3 | Ford, Miss. Doolina Margaret “Daisy” | female | 21 | 3 | 1 | W./C. 6608 | 34.375 | S | Unassigned | |
| 737 | 737 | 0 | 3 | Ford, Mrs. Edward (Margaret Ann Watson) | female | 48 | 1 | 4 | W./C. 6608 | 34.375 | S | Unassigned | |
| 1059 | 1059 | NA | 3 | Ford, Mr. Edward Watson | male | 18 | 3 | 1 | W./C. 6608 | 34.375 | S | Unassigned |
Additionally, there are problems with the Abbott family with ticket C.A., 2673:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | CabinAssignment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 280 | 280 | 1 | 3 | Abbott, Mrs. Stanton (Rosa Hunt) | female | 35 | 1 | 1 | C.A. 2673 | 20.25 | S | Unassigned | |
| 747 | 747 | 0 | 3 | Abbott, Mr. Rossmore Edward | male | 16 | 1 | 1 | C.A. 2673 | 20.25 | S | Unassigned | |
| 1284 | 1284 | NA | 3 | Abbott, Master. Eugene Joseph | male | 13 | 0 | 2 | C.A. 2673 | 20.25 | S | Unassigned |
Thirteen year old Master. Eugene Joseph Abbott has a Parch value of ‘2’, meaning he has two parents on board. A female on the same ticket, Mrs. Stanton Abbott, is 22 years older than Eugene at 35 years of age. Mrs. Stanton Abbott has a SibSp of ‘1’ and a Parch of ‘1’. The other person on the ticket is 16-year-old Mr. Rossmore Edward Abbott, with a SibSp of ‘1’ and a Parch of ‘1’. It appears that 16-year-old Mr. Rossmore Edward Abbott was incorrectly designated the spouse of Mrs. Stanton Abbott instead of her son. The corrections appear below.
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | CabinAssignment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 280 | 280 | 1 | 3 | Abbott, Mrs. Stanton (Rosa Hunt) | female | 35 | 0 | 2 | C.A. 2673 | 20.25 | S | Unassigned | |
| 747 | 747 | 0 | 3 | Abbott, Mr. Rossmore Edward | male | 16 | 1 | 1 | C.A. 2673 | 20.25 | S | Unassigned | |
| 1284 | 1284 | NA | 3 | Abbott, Master. Eugene Joseph | male | 13 | 1 | 1 | C.A. 2673 | 20.25 | S | Unassigned |
When importing into R, factor/class variables solely composed of numeric entries are interpreted and imported as numeric types, implying an ordering and a distance between entries. These variables include PassengerId, Survived, and Pclass. PassengerId is like a primary key for each passenger, so it’s not necessary to convert this variable into a factor. However, the Survived data is encoded as a binary flag where 1=survived and 0=not survived. Since our goal is to predict classes (i.e., survived or died), it is necessary to convert this value into a factor. The Pclass variable is a factor, but it is an ordered factor. An ordered factor indicates that there is an ordering between the classes (1st class, 2nd class, 3rd class), but no magnitude between each class level can be inferred.
train$Survived = as.factor(train$Survived);
train$Pclass = as.ordered(train$Pclass);
Two passengers had no information for their port of embarcation, Miss. Amelie Icard and Mrs. George Nelson Stone. These first class passengers were both on the same ticket (113572) and stayed in Cabin B28. The majority of passengers boarded at Southampton (70%), as compared to Cherbourg (21%) and Queenstown (9%). There is quite a variety in the ticket numbers on the Titanic. Their ticket number, 113572, is one in a series of similar ticket numbers, all in the 113XXX format. There are 56 tickets in the 113XXX format and of the 54 that have a valid port of embarcation, 81% were from Southampton. Considering the different data points, the two missing Embarked values will be considered ‘S’ for Southampton.
data_combined[62,"Embarked"] = as.factor("S");
data_combined[830,"Embarked"] = as.factor("S");
data_combined$Embarked = factor( data_combined$Embarked ); # removes empty factor ""
As determined in the EDA section, over 77% of the passengers have no recorded cabin information. In addition, the assigned cabins did not seem to have a predictable pattern from the available data. Further, cursory analyses did not indicate that cabin assignments were a good predictor of survivability. As such, the missing cabin data will not be imputed.
Out of 1309 records, one fare is missing:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | CabinAssignment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1044 | 1044 | NA | 3 | Storey, Mr. Thomas | male | 60.5 | 0 | 0 | 3701 | NA | S | Unassigned |
The fare can be estimated by considering the variables that most likely affect the Fare variable, such as passenger class (Pclass), port of Embarcation (Embarked), Cabin, and how many other people are on the same ticket. The passenger class and port of embarkation are readily available in each data row. The Cabin variable, however, is not useful in this case since the Cabin variable is not defined for passenger 1044. However, the fact that no Cabin was recorded for this passenger is significant since perhaps not all passengers were assigned cabins (i.e., wealthy passengers would be assigned cabins, and hence pay higher fares, while less privileged passengers may have to had large, unassigned community accomodations with a comparatively lower fare).
ticketChar = as.character( data_combined$Ticket );
numPassengers = nrow( data_combined );
data_combined$NumPassengersOnTicket = 1;
for (i in 1:numPassengers ){
thisTicket = as.character( data_combined[i,"Ticket"] );
idxPeople = which( thisTicket == ticketChar );
numPeople = length( idxPeople );
data_combined$NumPassengersOnTicket[i] = numPeople;
}
# now constrain to those that have just 1 person on the ticket, as passenger 1044
singlePassengers = data_combined$NumPassengersOnTicket == 1;
embarkedS = data_combined$Embarked == 'S';
unassignedCabins = data_combined$CabinAssignment == 'Unassigned';
pClass3 = data_combined$Pclass == 3;
similarPassengers = which( singlePassengers & embarkedS & unassignedCabins & pClass3 );
similarPassengersFareData = data_combined[similarPassengers,]; # include all columns but the one we have no Fare info
similarPassengersFareData = similarPassengersFareData[ similarPassengersFareData$PassengerId != 1044,]; # include all columns but the one we have no Fare info
medianFare = median( similarPassengersFareData$Fare );
data_combined[1044,"Fare"] = medianFare;
hist(similarPassengersFareData$Fare, xlab = "Fare", ylab = "Passengers", main = "Fare of Similar Passengers");
summary( similarPassengersFareData$Fare );
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.171 7.775 7.896 8.057 8.050 19.967
There are 318 passengers that share the same characteristics as passenger 1044: passenger class 3, port of embarkation Southampton, cabin assignment (unassigned), and the number of people with the same ticket (1). Excluding a fare of 3.1708 and two at 19.9667, the remaining fares are between 6.2375 and 10.5167. Passenger 1044 is assigned the median fare of 7.896.
As found above, the Age data is missing for 177 of the train set cases and 86 of the test set cases. Since Age is likely to be an important predictor of survival, the missing data should be imputed. As will be shown in the next section on Feature Engineering, the passenger’s title can be extracted from their name. The title should allow for a better estimate of a passenger’s age.
The missing passengers have the following titles:
## Capt. Col. Countess. Don. Dona. Dr. Jonkheer.
## 0 0 0 0 0 1 0
## Lady. Major. Master. Miss. Mlle. Mme. Mr.
## 0 0 8 50 0 0 176
## Mrs. Ms. Rev. Sir.
## 28 0 0 0
The resultant linear model uses the following average ages for each missing passenger title:
# model is simple: use the passenger's title (i.e., Mr., Master., Mrs.) to determine age
# May want to better this by using age of parents (if travelling with parents), age of siblings, spouse, etc.
age_lm = lm( Age~AgeTitle, data=data_combined);
missingAges = predict( age_lm, data_combined[missingAgeIndices,]);
data_combined[missingAgeIndices,"FixedAge"] = missingAges;
coeffs = coefficients( age_lm );
coeffNames = names(coeffs);
coeffValues = unname(coeffs);
# find (Intercept)
idxIntercept = match( "(Intercept)", coeffNames );
interceptValue = coeffValues[idxIntercept];
allOtherCoeffIndices = setdiff(1:length(coeffNames),idxIntercept);
allOtherCoeffValues = coeffValues[allOtherCoeffIndices];
titleAges = interceptValue + allOtherCoeffValues;
names(titleAges) = gsub( "AgeTitle", "", coeffNames[allOtherCoeffIndices] );
knitr::kable( bind_rows(titleAges), format="markdown", longtable=TRUE)
| Col. | Countess. | Don. | Dona. | Dr. | Jonkheer. | Lady. | Major. | Master. | Miss. | Mlle. | Mme. | Mr. | Mrs. | Ms. | Rev. | Sir. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 54 | 33 | 40 | 39 | 43.57143 | 38 | 48 | 48.5 | 5.482641 | 21.77424 | 24 | 24 | 32.25215 | 36.99412 | 28 | 41.25 | 49 |
The Name variable in the test and train datasets has some structure - surname followed by a comma, then a title and a given name. Additionally, a maiden name will be present in parentheses if the passenger is a married woman.
passenger_names = as.character(data_combined$Name);
head(passenger_names);
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
The implied structure is [SURNAME] COMMA [TITLE] [GIVEN NAME] (MAIDEN NAME if applicable). Since the goal is to extract the title information from each row programmatically, it is necessary to enforce some error checking - namely, that each row contain one and only one comma. If this condition does not hold true, the title extraction task will be more tedious.
num_commas = unname(sapply( passenger_names, str_count, ","));
all_commas = assert_that( all( num_commas == 1 ) ); # Make sure each row has a comma
# Now, extract titles. Split on comma and take the tail end of the character string
surnames = sapply( strsplit(as.character(passenger_names), ","), head, 1);
data_combined$Surname = as.factor( surnames );
given_name_string = sapply( strsplit(as.character(passenger_names), ","), tail, 1);
# Remove leading whitespace, if any
given_name_string = trimws( given_name_string, "left");
# To extract the title from the Name character string, split each string
name_tokens = strsplit( given_name_string, "\\s+");
# Exract first token as the title
titles = lapply(name_tokens,"[[",1);
given_names = lapply(name_tokens,"[[",2);
data_combined$GivenName = as.factor( unlist( given_names ) );
The full set of extracted titles:
passenger_titles = unlist( titles );
table( passenger_titles );
## passenger_titles
## Capt. Col. Don. Dona. Dr. Jonkheer. Lady.
## 1 4 1 1 8 1 1
## Major. Master. Miss. Mlle. Mme. Mr. Mrs.
## 2 61 260 2 1 757 197
## Ms. Rev. Sir. the
## 2 8 1 1
Most of these look correct and reasonable as titles (“Don.” and “Dona.” are Italian honorifics, as is “Jonkheer.” a Dutch honorific). However, the title “the” is suspicious. The complete row is as follows:
data_combined[ which( data_combined$Title == "Countess." ), ]
## PassengerId Survived Pclass
## 760 760 1 1
## Name Sex Age
## 760 Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) female 33
## SibSp Parch Ticket Fare Cabin Embarked CabinAssignment
## 760 0 0 110152 86.5 B77 S Assigned
## NumPassengersOnTicket FixedAge Surname
## 760 3 33 Rothes
## NameString GivenName
## 760 the Countess. of (Lucy Noel Martha Dyer-Edwards) Rothes Countess.
## Title FixedTitle AgeTitle
## 760 Countess. Mrs. Countess.
Although it makes little difference in this case since there is only a single “the Countess. of Rothes” on board, the title of this passenger should be changed from “the” to “Countess.”. This could be fixed by changing the single record, but the fact that this incorrect title occurred in the first place indicates that the methodology for finding titles isn’t robust enough. Since it appears that the only time a “.” appears in the Name field is at the end of a title, that information could be used to find all titles.
title_regs = regexpr( "[\\w+]+\\.", given_name_string, perl=TRUE );
re_titles = regmatches( given_name_string, title_regs );
data_combined$Title = as.factor( unlist( re_titles ));
summary( data_combined$Title );
## Capt. Col. Countess. Don. Dona. Dr. Jonkheer.
## 1 4 1 1 1 8 1
## Lady. Major. Master. Miss. Mlle. Mme. Mr.
## 1 2 61 260 2 1 757
## Mrs. Ms. Rev. Sir.
## 197 2 8 1
Of particular interest are the titles that are underrepresented and how they might relate to the most abundant titles (Mr., Mrs., Miss., and Master.). For instance, the title “Ms.” has only two occurences and as such, has limited predictive power. “Ms.” generally refers to an adult woman, either married or unmarried and so is much closer to either a “Mrs.” or “Miss.” title than a “Mr.”, for instance. Additionally, the French designations for “Miss.” and “Mrs.” appear as Madamoiselle (“Mlle.”) and Madame (“Mme.”). These titles would likely provide better predictive power if they were switched to their English equivalents.
Additionally, many of the Age attributes are missing from the dataset and since it seems likely that Age would be a strong predictor of survivability. Title should be a fairly robust predictor for Age, so engineering a title feature that is reflective of Age could be beneficial (i.e., a Doctor with a missing Age value is much more likely to be an adult than a child and adults have a generally lower survival rate than children). The titles of the passengers missing Age information are as follows:
# The set of titles that have missing age values
ageNA = is.na( data_combined$Age );
titleNA = data_combined$Title[ageNA];
summary( titleNA );
## Capt. Col. Countess. Don. Dona. Dr. Jonkheer.
## 0 0 0 0 0 1 0
## Lady. Major. Master. Miss. Mlle. Mme. Mr.
## 0 0 8 50 0 0 176
## Mrs. Ms. Rev. Sir.
## 27 1 0 0
For the purposes of estimating the Age variable for this group, the “Ms.” entry is better suited as a “Mrs.”. Although the “Dr.” title has a relatively small number of instances, it is likely enough from which to draw a reasonable age estimate.
# The set of titles that have missing age values
data_combined$AgeTitle = data_combined$Title;
data_combined$AgeTitle[ which( data_combined$Title == "Ms." & ageNA )] = as.factor( "Mrs." );
summary( data_combined$AgeTitle );
## Capt. Col. Countess. Don. Dona. Dr. Jonkheer.
## 1 4 1 1 1 8 1
## Lady. Major. Master. Miss. Mlle. Mme. Mr.
## 1 2 61 260 2 1 757
## Mrs. Ms. Rev. Sir.
## 198 1 8 1
Next, the Parch variable encodes two different measures: the number of parents a passenger has on board AND/OR the number of children a passenger has on board. Since this variable is overloaded, a more informative variable set might disambiguate these measures into NumParents and NumChildren (aboard). Such a set of variables may provide more predictive power than the single Parch variable.
numParents = integer( nrow( data_combined ) );
numChildren = integer( nrow( data_combined ) );
MAX_CHILD_AGE = 14;
# Find all Parch > 0
posParch = data_combined$Parch > 0;
idxParch = which( posParch );
counter = 0;
numCases = length( idxParch );
for (thisRow in idxParch){
sibsp = data_combined[thisRow,"Sibsp"];
# if sibsp > 1, then the passenger is travelling with siblings (and therefore, likely parents)
counter = counter + 1;
numParch = data_combined[thisRow,"Parch"];
thisSurname = data_combined[thisRow,"Surname"];
thisTicket = data_combined[thisRow,"Ticket"];
thisTitle = as.character( data_combined[thisRow,"Title"] );
thisAge = data_combined[thisRow,"Age"];
if ( (thisTitle == "Master.") || (thisTitle == "Miss.") ){
# if this passenger is a "Master." or Miss. and numParch <= 2, he/she must be someone's child
numParents[thisRow] = numParch;
#next;
}
if (!is.na( thisAge ) && thisAge <= MAX_CHILD_AGE ){
# if this passenger is young, declare they cannot be parents, must be a child
numParents[thisRow] = numParch;
#next;
}
# get passenger rows on same ticket
sameTickets = data_combined$Ticket == thisTicket;
sameSurnames = data_combined$Surname == thisSurname;
sameSurnameSameTicket = sameTickets & sameSurnames & posParch;
ticketTitles = as.character( data_combined[ sameSurnameSameTicket, "Title"] );
ticketParches = data_combined[ sameSurnameSameTicket, "Parch"];
# now, look for passengers with the same surname on the ticket and check their titles and ages
numSame = length( sameSurnameSameTicket );
ages = sort( data_combined[sameSurnameSameTicket,"Age"] );
thisAgePos = which( thisAge == ages )[1]; # find index of thisAge in sorted ages
gaps = diff( ages );
numGenerationalGaps = length( which( gaps > MAX_CHILD_AGE) );
if ( numGenerationalGaps == 0 ){
if ( !is.na( thisAge )){
if ( ( thisAge >= 40 ) || (thisTitle == "Mrs." ) ){
numChildren[thisRow] = numParch;
}
else if (numParch > 2 ){
numChildren[thisRow] = numParch;
}
else{
numParents[thisRow] = numParch;
}
}
else{
# no age information. If travelling with "kids", and title isn't a kid, then parent
if ( ! ( thisTitle %in% c("Master.","Miss.") ) ){
if( any( c("Master.","Miss.") %in% ticketTitles ) ){
# now, if there are two Mr. in this group, we need to choose the real father
# If there are three or more children, then Parch will be greater than 2 and
# will indicate this is the father. Else, it will be a child
if( thisTitle == "Mr."){
maxParches = max( ticketParches );
if ( ( maxParches > 2 ) && ( maxParches == numParch ) ){
numChildren[thisRow] = numParch;
}
else{
numParents[thisRow] = numParch;
}
}
else{
numChildren[thisRow] = numParch;
}
}
else{
# all we have is a non Mr. or Miss. title. Make them a child
numChildren[thisRow] = numParch;
}
}
else{
numParents[thisRow] = numParch;
}
}
}
else{
if ( !is.na(thisAge) ){ # use age in comparison to generation gap to classify kids/parents
maxGapPos = which.max( gaps ) + 0.5; # the 0.5 puts it in the middle of the kids/parents
if ( thisAgePos < maxGapPos ){
numParents[thisRow] = numParch;
}
else{
numChildren[thisRow] = numParch;
}
}
else{
# no age info, have to go with titles
}
}
totalParch = numParents[thisRow] + numChildren[thisRow];
if ( totalParch != numParch ){
stop( "Number of Parents/Children assigned (", totalParch, ") does not equal the Parch variable (", numParch, ") for passenger ", data_combined[thisRow,"PassengerId"], "\n");
}
data_combined$NumParents = numParents;
data_combined$NumChildren = numChildren;
}
Similar to Parch, the SibSp variable indicates the number of Siblings AND/OR Spouses a passenger has on board. A better set of variables would be separate variables for Siblings and Spouses as it would be possible to model the original SibSp vector as a simple combination of the separate columns.
numSiblings = integer( nrow( data_combined ) );
numSpouses = integer( nrow( data_combined ) );
MAX_CHILD_AGE = 14;
posSibSp = data_combined$SibSp > 0;
idxSibSp = which( posSibSp );
counter = 0;
numCases = length( idxSibSp );
for (thisRow in idxSibSp){
counter = counter + 1;
numSibSp = data_combined[thisRow,"SibSp"];
numParch = data_combined[thisRow,"Parch"];
thisSurname = data_combined[thisRow,"Surname"];
thisTicket = data_combined[thisRow,"Ticket"];
thisTitle = as.character( data_combined[thisRow,"Title"] );
thisAge = data_combined[thisRow,"Age"];
thisSex = as.character( data_combined[thisRow,"Sex"] );
thisGivenName = as.character( data_combined[thisRow,"GivenName"] );
thisPassengerId = data_combined[thisRow,"PassengerId"];
if ( (thisTitle == "Master.") || (thisTitle == "Miss.") ){
# if this passenger is a "Master." or Miss., this passenger is not married, so must be spouse
numSiblings[thisRow] = numSibSp;
next;
}
if (!is.na( thisAge ) && thisAge <= 10 ){
# if this passenger is young, declare they cannot be parents, must be a child
numSiblings[thisRow] = numSibSp;
next;
}
# get passenger rows on same ticket
sameTickets = data_combined$Ticket == thisTicket;
sameSurnames = data_combined$Surname == thisSurname;
sameSurnameSameTicket = sameTickets & sameSurnames & posSibSp;
# if married, look for spouse on the same ticket
# get given names on same ticket
same_ticket_rows = data_combined[which(sameSurnameSameTicket),];
given_names = as.character( same_ticket_rows$GivenName );
titles = as.character( same_ticket_rows$Title );
sexes = as.character( same_ticket_rows$Sex );
passengerIds = same_ticket_rows$PassengerId;
master_mask = ( titles != "Master." );
sex_mask = ( sexes != thisSex ); # opposite sex marriage
miss_mask = ( titles != "Miss." );
this_mask = ( passengerIds != thisPassengerId );
idxMatch = which( ( thisGivenName == given_names ) & master_mask & sex_mask & miss_mask & this_mask);
if ( length( idxMatch ) == 1 ){
numSpouses[thisRow] = 1;
next;
}
ticketTitles = as.character( data_combined[ sameSurnameSameTicket, "Title"] );
ticketParches = data_combined[ sameSurnameSameTicket, "Parch"];
# now, look for passengers with the same surname on the ticket and check their titles and ages
numSame = length( sameSurnameSameTicket );
ages = sort( data_combined[sameSurnameSameTicket,"Age"] );
thisAgePos = which( thisAge == ages )[1]; # find index of thisAge in sorted ages
gaps = diff( ages );
numGenerationalGaps = length( which( gaps > MAX_CHILD_AGE) );
}
# do some manual corrections
numSiblings[168] = 0; # Mrs. William Skoog
numSpouses[168] = 1; # Mrs. William Skoog
numSiblings[361] = 0; # Mr. Wilhelm Skoog
numSpouses[361] = 1; # Mr. Wilhelm Skoog
numSiblings[679] = 0; # Mrs. Frederick Goodwin
numSpouses[679] = 1; # Mrs. Frederick Goodwin
numSiblings[1031] = 0; # Mr. Charles Frederick Goodwin
numSpouses[1031] = 1; # Mr. Charles Frederick Goodwin
numSiblings[557] = 0; # Lady. Duff Gordon
numSpouses[557] = 1; # Lady. Duff Gordon
numSiblings[600] = 0; # Sir. Duff Gordon
numSpouses[600] = 1; # Sir. Duff Gordon
numSiblings[746] = 0; # Capt. Edward Gifford Crosby
numSpouses[746] = 1; # Capt. Edward Gifford Crosby
numSiblings[1197] = 0; # Mrs. Edward Gifford Crosby
numSpouses[1197] = 1; # Mrs. Edward Gifford Crosby
numSiblings[1059] = 3; # Mr. Edward Watson Ford
numSpouses[1059] = 0; # Mr. Edward Watson Ford
counter = 0;
for (thisRow in idxSibSp){
counter = counter + 1;
numSibSp = data_combined[thisRow,"SibSp"];
numParch = data_combined[thisRow,"Parch"];
thisSurname = data_combined[thisRow,"Surname"];
thisTicket = data_combined[thisRow,"Ticket"];
thisTitle = as.character( data_combined[thisRow,"Title"] );
thisAge = data_combined[thisRow,"Age"];
thisSex = as.character( data_combined[thisRow,"Sex"] );
thisGivenName = as.character( data_combined[thisRow,"GivenName"] );
thisPassengerId = data_combined[thisRow,"PassengerId"];
numSiblings[thisRow] = numSibSp - numSpouses[thisRow];
}
data_combined$Spouses = numSpouses;
data_combined$SpousesFactor = factor( numSpouses );
data_combined$Siblings = numSiblings;
As determined in the EDA section, the Fare variable is not per passenger, it is the price paid for the ticket the passenger is traveling under (and multiple passengers may travel on the same ticket).
ticketChar = as.character( data_combined$Ticket );
uniqueTickets = unique( ticketChar );
farePerPassenger = data_combined$Fare;
for (i in 1:length(uniqueTickets) ){
thisTicket = uniqueTickets[i];
idxPeople = which( ticketChar == thisTicket );
theseFares = data_combined[idxPeople,"Fare"];
#if ( !all( theseFares == theseFares[1] ) ){
# cat( thisTicket )
#}
# should only be a single fare
numPeople = length( idxPeople );
farePerPassenger[idxPeople] = theseFares[1]/numPeople;
}
data_combined$FarePerPassenger = farePerPassenger;
This is a classification problem with two classes: { Died, Survived }. This is encoded in the ‘Survived’ variable in the train dataset.
## Of 891 passengers, 549 died and 342 survived.
## Mean Survivability = 0.3838384
The simplest model would be to assume that all test subjects are members of the most common class in the train dataset. In the train dataset, 62% of the subjects died, so this simple model would assume that all the test passengers die. This would yield a prediction accuracy of about 62%, and a corresponding misclassification rate of 38%, all false negatives.
This simple model is also the fastest to implement and serves as a good initial benchmark from which to improve upon.
Upon submission to the Kaggle site, the model yielded a score of 0.62679. In terms of the bias/variance tradeoff, this is a very biased model (very inflexible) and the variance will likely be very small (i.e., the choice of training set will not affect predictions on the test set except in rare cases where more survivors are selected than non-survivors).
Along with the train.csv and test.csv file, Kaggle provides an additional file called ‘gender_submission.csv’. This file is a sample submission to Kaggle in which all the female passengers are predicted to survive while all of the male passengers are predicted to die.
In the training set, of the 891 passengers, 342 are female. 233 of the 342 female passengers in the training set survived (74.2%). Of the 549 male passengers in the train set, 468 did not survive (81.1%). This yields an overall classification rate of 78.7% for the test set. On the bias/variance continuum, this model is very biased and should have small variance.
Upon submission to Kaggle, this model scored 0.76555.
The canonical model for a sinking ship is that women and children get preferential access to lifeboats while men are expected to defer. If 100% accordance to this model is assumed and all women and children are boarded onto lifeboats while the men are left to fight it out for flotsam in the frigid waters, women and children would be expected to survive while the men would perish.
Since the Age variable is missing for 177 of the train set cases and 86 test set cases, the imputed Age values from above will be used. For this application, a child will be defined as a person under 15 years of age.
## Warning: package 'caret' was built under R version 3.4.2
In the training set, the proportion of correct predictions is 0.79.
Upon submission to Kaggle, the Women and Children First model scored 0.76076.
In Linear Discriminant Analysis, the distributions of each of the predictors (Passenger Class, Age, Sex, etc.) are modeled separately for the Survived and Died classes. Then, Bayes’ theorem can be flipped around to get estimates for the probability of Survived or Died given the set of predictors. LDA assumes that the observations in each class (Survived, Died) are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a covariance matrix common to both classes. This assumption does not hold for some of the predictors in this analysis, as not only are they not all normal, some are categorical (Sex, Embarked). Nevertheless, the implementation in R converts the variables to a numeric albeit not-normal representation.
The first LDA model is one in which all predictors are considered: pClass, Sex, FixedAge, FarePerPassenger, FixedTitle, CabinAssignment, Embarked, NumPassengersOnTicket, NumParents, NumChildren, SpousesFactor, and Siblings. Against the training data, the model scored 0.8373 (~84% accuracy). When run against the test set, the model scored 0.78947.
library(MASS, quietly=TRUE);
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
xTrain = whiteTrain[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = whiteTest[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
lda.fit = lda( SurvivedFactor ~., data=xTrain);
lda.fit
## Call:
## lda(SurvivedFactor ~ ., data = xTrain)
##
## Prior probabilities of groups:
## 0 1
## 0.6161616 0.3838384
##
## Group means:
## pclass2 pclass3 Sexmale FixedAge FarePerPassenger
## 0 0.1766849 0.6775956 0.8524590 0.0586912 -0.2427148
## 1 0.2543860 0.3479532 0.3187135 -0.1214275 0.3509019
## FixedTitleMiss. FixedTitleMr. FixedTitleMrs. CabinAssignmentUnassigned
## 0 0.1001821 0.7941712 0.07468124 0.8761384
## 1 0.3713450 0.2368421 0.32456140 0.6023392
## EmbarkedQ EmbarkedS NumPassengersOnTicket NumParents NumChildren
## 0 0.0856102 0.7777778 -0.1662175 -0.04990379 -0.0429864357
## 1 0.0877193 0.6403509 0.2942616 0.16516206 -0.0002207613
## SpousesFactor1 Siblings
## 0 0.1220401 0.09339948
## 1 0.2076023 -0.07771321
##
## Coefficients of linear discriminants:
## LD1
## pclass2 -0.05937873
## pclass3 -0.63312860
## Sexmale -2.58719687
## FixedAge -0.17944828
## FarePerPassenger 0.14185791
## FixedTitleMiss. -2.59463695
## FixedTitleMr. -2.09196720
## FixedTitleMrs. -2.12049380
## CabinAssignmentUnassigned -0.51423840
## EmbarkedQ -0.02540707
## EmbarkedS -0.24815654
## NumPassengersOnTicket 0.06028440
## NumParents -0.02647490
## NumChildren -0.19587549
## SpousesFactor1 -0.20253519
## Siblings -0.33913849
plot( lda.fit );
lda.train.pred = predict( lda.fit, xTrain );
ldaTrainPredictions = as.numeric( lda.train.pred$class ) - 1;
truth = newTrain$Survived;
confusionMatrix( ldaTrainPredictions, truth );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 488 84
## 1 61 258
##
## Accuracy : 0.8373
## 95% CI : (0.8114, 0.8609)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6515
## Mcnemar's Test P-Value : 0.0677
##
## Sensitivity : 0.8889
## Specificity : 0.7544
## Pos Pred Value : 0.8531
## Neg Pred Value : 0.8088
## Prevalence : 0.6162
## Detection Rate : 0.5477
## Detection Prevalence : 0.6420
## Balanced Accuracy : 0.8216
##
## 'Positive' Class : 0
##
# scored a 0.8373 against the training data
lda.test.pred = predict( lda.fit, xTest );
# lda.pred$class is a factor; need to convert to 0, 1
ldaPredictions = as.numeric( lda.test.pred$class ) - 1;
ldaModelTrain = data.frame(PassengerId=train$PassengerId, Survived=ldaTrainPredictions);
ldaModel = data.frame(PassengerId=test$PassengerId, Survived=ldaPredictions);
write.csv( ldaModelTrain, "LDAModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModelTrain, "LDAModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModel, "LDAModel.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModel, "LDAModelProbs.csv", row.names=FALSE, quote=FALSE );
# scored a 0.78947
In subsequent analyses (logistic regression, random forests), the predictors are measured for effect on the response variable. The logistic regression analysis indicates which covariates have a low p-value (i.e., <0.05), indicating a significant effect. Similarly, random forests produce a variable importance measure which shows which covariates are used most often in tree generation. These measures give a good idea of which variables are likely to have the most predictive power. Using these variables and leaving out the variables with higher p-values and those not often used in random forests will likely reduce overfitting. The variables with the best predictive power in this set are pClass, Sex, FixedAge, FarePerPassenger, FixedTitle, CabinAssignment, NumChildren, SpousesFactor, and Siblings.
The LDA model using only significant features scored a 0.8339 against the training data and, similar to the full-featured LDA model, scored a 0.78947.
library(MASS);
xTrain = whiteTrain[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","NumChildren","SpousesFactor","Siblings")];
xTest = whiteTest[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","NumChildren","SpousesFactor","Siblings")];
lda.fit = lda( SurvivedFactor ~., data=xTrain);
plot( lda.fit );
lda.train.pred = predict( lda.fit, xTrain );
ldaTrainPredictions = as.numeric( lda.train.pred$class ) - 1;
truth = newTrain$Survived;
confusionMatrix( ldaTrainPredictions, truth );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 490 89
## 1 59 253
##
## Accuracy : 0.8339
## 95% CI : (0.8078, 0.8578)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6429
## Mcnemar's Test P-Value : 0.01714
##
## Sensitivity : 0.8925
## Specificity : 0.7398
## Pos Pred Value : 0.8463
## Neg Pred Value : 0.8109
## Prevalence : 0.6162
## Detection Rate : 0.5499
## Detection Prevalence : 0.6498
## Balanced Accuracy : 0.8161
##
## 'Positive' Class : 0
##
# scored a 0.8339 against the training data
lda.test.pred = predict( lda.fit, xTest );
# lda.pred$class is a factor; need to convert to 0, 1
ldaPredictions = as.numeric( lda.test.pred$class ) - 1;
ldaModelTrain = data.frame(PassengerId=train$PassengerId, Survived=ldaTrainPredictions);
ldaModel = data.frame(PassengerId=test$PassengerId, Survived=ldaPredictions);
write.csv( ldaModelTrain, "LDAReducedModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModelTrain, "LDAReducedModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModel, "LDAReducedModel.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModel, "LDAReducedModelProbs.csv", row.names=FALSE, quote=FALSE );
# scored a 0.78947
Quadratic Discriminant Analysis (QDA) is very much similar to LDA but with one difference: there is no assumption that the classes have a common variance. In this model, each class has its own covariance matrix. QDA is so named because the discriminant functions are quadratic in the predictor vector, as opposed to linear with LDA. Since the discriminant function is quadratic, the decision boundary is not constrained to a line as with LDA and can assume curved shapes.
The QDA model with all variables scored a 0.8159 against the training set and 0.77033 against the test data.
library(MASS);
xTrain = whiteTrain[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = whiteTest[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
qda.fit = qda( SurvivedFactor ~., data=xTrain);
qda.train.pred = predict( qda.fit, xTrain );
qdaTrainPredictions = as.numeric( qda.train.pred$class ) - 1;
truth = newTrain$Survived;
confusionMatrix( qdaTrainPredictions, truth );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 475 90
## 1 74 252
##
## Accuracy : 0.8159
## 95% CI : (0.7889, 0.8409)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6074
## Mcnemar's Test P-Value : 0.2415
##
## Sensitivity : 0.8652
## Specificity : 0.7368
## Pos Pred Value : 0.8407
## Neg Pred Value : 0.7730
## Prevalence : 0.6162
## Detection Rate : 0.5331
## Detection Prevalence : 0.6341
## Balanced Accuracy : 0.8010
##
## 'Positive' Class : 0
##
# scored a 0.8159 against the training data
qda.test.pred = predict( qda.fit, xTest );
# lda.pred$class is a factor; need to convert to 0, 1
qdaPredictions = as.numeric( qda.test.pred$class ) - 1;
qdaModelTrain = data.frame(PassengerId=train$PassengerId, Survived=qdaTrainPredictions);
qdaModel = data.frame(PassengerId=test$PassengerId, Survived=qdaPredictions);
write.csv( qdaModelTrain, "QDAModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModelTrain, "QDAModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModel, "QDAModel.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModel, "QDAModelProbs.csv", row.names=FALSE, quote=FALSE );
# scored a 0.77033
The QDA model using only significant features scored a 0.7845 against the training set and 0.78947 against the test data.
library(MASS);
xTrain = whiteTrain[,c("SurvivedFactor","pclass","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","NumChildren","Siblings")];
xTest = whiteTest[,c("SurvivedFactor","pclass","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","NumChildren","Siblings")];
qda.fit = qda( SurvivedFactor ~., data=xTrain);
qda.train.pred = predict( qda.fit, xTrain );
qdaTrainPredictions = as.numeric( qda.train.pred$class ) - 1;
truth = newTrain$Survived;
confusionMatrix( qdaTrainPredictions, truth );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 444 87
## 1 105 255
##
## Accuracy : 0.7845
## 95% CI : (0.756, 0.8111)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5489
## Mcnemar's Test P-Value : 0.2199
##
## Sensitivity : 0.8087
## Specificity : 0.7456
## Pos Pred Value : 0.8362
## Neg Pred Value : 0.7083
## Prevalence : 0.6162
## Detection Rate : 0.4983
## Detection Prevalence : 0.5960
## Balanced Accuracy : 0.7772
##
## 'Positive' Class : 0
##
# scored a 0.7845 against the training data
qda.test.pred = predict( qda.fit, xTest );
# lda.pred$class is a factor; need to convert to 0, 1
qdaPredictions = as.numeric( qda.test.pred$class ) - 1;
qdaModelTrain = data.frame(PassengerId=train$PassengerId, Survived=qdaTrainPredictions);
qdaModel = data.frame(PassengerId=test$PassengerId, Survived=qdaPredictions);
write.csv( qdaModelTrain, "QDASigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModelTrain, "QDASigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModel, "QDASigModel.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModel, "QDASigModelProbs.csv", row.names=FALSE, quote=FALSE );
# scored a 0.78947
In a linear regression model, the probability of the response variable is modeled as a linear combination of predictors and multiplicative coefficients. The encoding of the response variable (Survived) as 0 and 1 allows for a linear regression model but covariates that produce a response greater than 1 and less than 0 are considered errors and can negatively impact parameter estimation. In logistic regression, the probability of the response variable is modeled with the logistic function, or y = eX/(1+eX). Logistic regression is a classic model for binary classification.
A logistic regression model is fit using the train data. Selected variables are shown below. Some variables could be treated as either numeric or factors (i.e., NumPassengersOnTicket and NumChildren). Because there was a natural ordering of these variables, they are treated as numeric values rather than factors. This was done to capture the linear dependence of each variable to the survival response.
all_features = c("pclass", "Sex", "FixedAge", "FarePerPassenger", "FixedTitle", "Siblings", "CabinAssignment", "Embarked", "NumPassengersOnTicket", "NumParents", "NumChildren", "SpousesFactor");
full_df = newTrain[, c("Survived", all_features)];
full.glm.fit = glm( Survived~., data=full_df, family=binomial);
summary( full.glm.fit );
##
## Call:
## glm(formula = Survived ~ ., family = binomial, data = full_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4086 -0.5641 -0.3761 0.5328 2.6498
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.50564 1.06421 7.053 1.75e-12 ***
## pclass2 -0.28984 0.47212 -0.614 0.53928
## pclass3 -1.25993 0.49740 -2.533 0.01131 *
## Sexmale -3.60602 0.65073 -5.541 3.00e-08 ***
## FixedAge -0.02796 0.01022 -2.736 0.00622 **
## FarePerPassenger 0.01514 0.01310 1.156 0.24775
## FixedTitleMiss. -4.08170 0.84886 -4.808 1.52e-06 ***
## FixedTitleMr. -3.34786 0.58437 -5.729 1.01e-08 ***
## FixedTitleMrs. -3.20474 0.82194 -3.899 9.66e-05 ***
## Siblings -0.71829 0.17065 -4.209 2.56e-05 ***
## CabinAssignmentUnassigned -0.80448 0.36274 -2.218 0.02657 *
## EmbarkedQ -0.06568 0.40009 -0.164 0.86961
## EmbarkedS -0.41933 0.25383 -1.652 0.09852 .
## NumPassengersOnTicket 0.10632 0.09815 1.083 0.27875
## NumParents -0.23028 0.26983 -0.853 0.39341
## NumChildren -0.48163 0.18715 -2.574 0.01007 *
## SpousesFactor1 -0.39542 0.30922 -1.279 0.20097
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 714.88 on 874 degrees of freedom
## AIC: 748.88
##
## Number of Fisher Scoring iterations: 5
trainSetProbs = predict( full.glm.fit, newTrain, type="response"); # now a vector of probabilities
# probabilities >= 0.5 mean survived, < 0.5 mean perished
# training set performance
threshold = 0.5;
predictions = as.numeric( trainSetProbs >= threshold );
confusionMatrix( predictions, newTrain$Survived );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 485 84
## 1 64 258
##
## Accuracy : 0.8339
## 95% CI : (0.8078, 0.8578)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6449
## Mcnemar's Test P-Value : 0.1183
##
## Sensitivity : 0.8834
## Specificity : 0.7544
## Pos Pred Value : 0.8524
## Neg Pred Value : 0.8012
## Prevalence : 0.6162
## Detection Rate : 0.5443
## Detection Prevalence : 0.6386
## Balanced Accuracy : 0.8189
##
## 'Positive' Class : 0
##
# now, how'd we do against the test set?
testSetProbs = predict( full.glm.fit, newTest, type="response");
fullLogisticPredictions = as.numeric( testSetProbs >= threshold );
fullLogisticRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=predictions);
fullLogisticRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=fullLogisticPredictions);
fullLogisticRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=trainSetProbs);
fullLogisticRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=testSetProbs);
write.csv( fullLogisticRegressionModelTrainProbs, "LogisticRegressionFullModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLogisticRegressionModelProbs, "LogisticRegressionFullModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLogisticRegressionModelTrain, "LogisticRegressionFullModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLogisticRegressionModel, "LogisticRegressionFullModel.csv", row.names=FALSE, quote=FALSE );
# scored a 0.78947 - not expected
The logistic regression model was first fit with all 12 possible covariates. After fitting the model to the training data, the model was run on the training set yielding a training set test error of 0.1661055 (0.8338945 accuracy rate). The model had an overall accuracy of 0.78947 on the test set (the score from Kaggle). Since this was lower than the expected score (the training set yielded a score of 0.8338945), it appears the model may be overfitting the training data.
library(caret);
library(lattice);
cat( "Training Set performance:\n");
## Training Set performance:
confusionMatrix( predictions, newTrain$Survived);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 485 84
## 1 64 258
##
## Accuracy : 0.8339
## 95% CI : (0.8078, 0.8578)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6449
## Mcnemar's Test P-Value : 0.1183
##
## Sensitivity : 0.8834
## Specificity : 0.7544
## Pos Pred Value : 0.8524
## Neg Pred Value : 0.8012
## Prevalence : 0.6162
## Detection Rate : 0.5443
## Detection Prevalence : 0.6386
## Balanced Accuracy : 0.8189
##
## 'Positive' Class : 0
##
Because the logistic regression model with all 12 predictors has a test set error somewhat larger than the training set error, the model is likely fitting noise in the training set. It’s very likely that some of the features are co-linear, for instance, Passenger Class (pclass) and FarePerPassenger. If the number of predictors can be reduced, the variance of the model can likely be lowered. After running the full set of predictors, the model shows several variables as significant: FixedTitle, Sex, and Siblings. Other variables that show nearly significant correlation to the response are FixedAge, pclass, CabinAssignment, and NumChildren. If those variables are brought forward and tested separately in different combinations, a model with a lower cross validation error might result. As a final measure, a separate model was run with the remaining “leftover” features: FarePerPassenger, Embarked, NumPassengersOnTicket, SpousesFactor, and NumParents. The idea is to see if any significant features can be determined in the absence of the significant features of the full model.
leftoverFeatures = c("FarePerPassenger","Embarked","NumPassengersOnTicket","SpousesFactor","NumParents");
leftover_df = newTrain[, c("Survived", leftoverFeatures)];
leftover.glm.fit = glm( Survived ~., data = leftover_df, family=binomial);
summary(leftover.glm.fit);
##
## Call:
## glm(formula = Survived ~ ., family = binomial, data = leftover_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6601 -0.8987 -0.7557 1.1841 1.9004
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.955763 0.234702 -4.072 4.66e-05 ***
## FarePerPassenger 0.048603 0.007031 6.912 4.77e-12 ***
## EmbarkedQ 0.095578 0.301596 0.317 0.751314
## EmbarkedS -0.483531 0.194625 -2.484 0.012976 *
## NumPassengersOnTicket -0.046748 0.053266 -0.878 0.380145
## SpousesFactor1 0.520507 0.213264 2.441 0.014660 *
## NumParents 0.594420 0.173922 3.418 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 1080.1 on 884 degrees of freedom
## AIC: 1094.1
##
## Number of Fisher Scoring iterations: 4
The logistic regression model with the “leftover” features yielded two new feature possibilities: FarePerPassenger and NumParents. These features are likely proxies for passenger class (pclass) and age (FixedAge). i.e., passengers travelling with their parents are more likely to be children and thus, have a higher survival probability. Similarly, FarePerPassenger is likely to reflect the passenger class separation. However, unlike pclass, FarePerPassenger is a continuous variable and possibly more informative than the three passenger classes. So, combined with the significant features from the full logistic regression model, a set of features can be drawn upon in different combinations in order to select the model with the lowest cross validation error.
library(boot, quietly = TRUE);
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
significant_features = c("FixedTitle", "Sex", "Siblings", "FixedAge", "CabinAssignment", "pclass", "NumChildren", "FarePerPassenger", "Embarked");
all_combinations_of_significant_features = lapply(1:length(significant_features), function(x) combn(length(significant_features),x));
numFeatureSetClasses = length( all_combinations_of_significant_features );
correct = list();
accuracies = list();
errors = list();
features = list();
numFeatures = list();
counter = 1;
for ( featureClass in 1:numFeatureSetClasses){
featClassColumns = dim( all_combinations_of_significant_features[[featureClass]] )[2];
for ( thisCol in 1:featClassColumns ){
feature_set = significant_features[ all_combinations_of_significant_features[[featureClass]][,thisCol]];
nFeatures = length( feature_set );
df = newTrain[, c("Survived", feature_set)]
glm.fit = glm( Survived~.,data=df,family=binomial);
cv.glm.fit = cv.glm(df,glm.fit,K=10);
cv.error = cv.glm.fit$delta[1];
trainSetProbs = predict( glm.fit, newTrain, type="response"); # now a vector of probabilities
predictions = as.numeric( trainSetProbs >= threshold );
numRight = sum( predictions == newTrain$Survived );
accuracy = mean( predictions == newTrain$Survived );
features[counter] = list( feature_set );
numFeatures[counter] = nFeatures;
accuracies[counter] = accuracy;
correct[counter] = numRight;
errors[counter] = cv.error;
counter = counter + 1;
}
}
cvResults = data.frame( NumFeatures=unlist(numFeatures), CVError=unlist(errors), TrainingScore=unlist(accuracies), NumCorrect=unlist(correct), Features=I(features) );
cvResults = cvResults[order(cvResults$CVError),];
knitr::kable( head(cvResults,n=20L), format="markdown", longtable=TRUE)
| NumFeatures | CVError | TrainingScore | NumCorrect | Features | |
|---|---|---|---|---|---|
| 466 | 7 | 0.1278208 | 0.8305275 | 740 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren |
| 503 | 8 | 0.1284761 | 0.8395062 | 748 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren, Embarked |
| 502 | 8 | 0.1284869 | 0.8282828 | 738 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren, FarePerPassenger |
| 473 | 7 | 0.1287440 | 0.8350168 | 744 | FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren, Embarked |
| 467 | 7 | 0.1288112 | 0.8260382 | 736 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger |
| 511 | 9 | 0.1291203 | 0.8350168 | 744 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren, FarePerPassenger, Embarked |
| 468 | 7 | 0.1292210 | 0.8338945 | 743 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, Embarked |
| 382 | 6 | 0.1294414 | 0.8271605 | 737 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass |
| 386 | 6 | 0.1294508 | 0.8249158 | 735 | FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren |
| 257 | 5 | 0.1297716 | 0.8249158 | 735 | FixedTitle, Sex, Siblings, FixedAge, pclass |
| 388 | 6 | 0.1299351 | 0.8271605 | 737 | FixedTitle, Sex, Siblings, FixedAge, pclass, Embarked |
| 506 | 8 | 0.1300108 | 0.8327722 | 742 | FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren, FarePerPassenger, Embarked |
| 507 | 8 | 0.1300126 | 0.8260382 | 736 | FixedTitle, Sex, Siblings, CabinAssignment, pclass, NumChildren, FarePerPassenger, Embarked |
| 392 | 6 | 0.1300478 | 0.8282828 | 738 | FixedTitle, Sex, Siblings, CabinAssignment, pclass, NumChildren |
| 477 | 7 | 0.1300765 | 0.8316498 | 741 | FixedTitle, Sex, Siblings, CabinAssignment, pclass, NumChildren, Embarked |
| 504 | 8 | 0.1301593 | 0.8294052 | 739 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger, Embarked |
| 474 | 7 | 0.1303197 | 0.8271605 | 737 | FixedTitle, Sex, Siblings, FixedAge, pclass, FarePerPassenger, Embarked |
| 476 | 7 | 0.1306408 | 0.8249158 | 735 | FixedTitle, Sex, Siblings, CabinAssignment, pclass, NumChildren, FarePerPassenger |
| 472 | 7 | 0.1309137 | 0.8204265 | 731 | FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren, FarePerPassenger |
| 265 | 5 | 0.1310824 | 0.8282828 | 738 | FixedTitle, Sex, Siblings, pclass, NumChildren |
Of the 512 different feature combinations, the top 20 are listed. The cross validation error estimates were made using 10-fold cross validation. The top 10 are retested using LOOCV:
correct = list();
accuracies = list();
errors = list();
features = list();
numFeatures = list();
counter = 1;
for ( rowNum in 502:511){
feature_set = cvResults$Features[[rowNum]];
nFeatures = cvResults$NumFeatures[rowNum];
df = newTrain[, c("Survived", feature_set)]
glm.fit = glm( Survived~.,data=df,family=binomial);
cv.glm.fit = cv.glm(df,glm.fit);
cv.error = cv.glm.fit$delta[1];
trainSetProbs = predict( glm.fit, newTrain, type="response"); # now a vector of probabilities
predictions = as.numeric( trainSetProbs >= threshold );
numRight = sum( predictions == newTrain$Survived );
accuracy = mean( predictions == newTrain$Survived );
features[counter] = list( feature_set );
numFeatures[counter] = nFeatures;
accuracies[counter] = accuracy;
correct[counter] = numRight;
errors[counter] = cv.error;
counter = counter + 1;
}
loocvResults = data.frame( NumFeatures=unlist(numFeatures), CVError=unlist(errors), TrainingScore=unlist(accuracies), NumCorrect=unlist(correct), Features=I(features) );
loocvResults = loocvResults[order(loocvResults$CVError),];
knitr::kable( head(loocvResults,n=10L), format="markdown", longtable=TRUE)
| NumFeatures | CVError | TrainingScore | NumCorrect | Features | |
|---|---|---|---|---|---|
| 2 | 1 | 0.2312228 | 0.6363636 | 567 | Embarked |
| 3 | 3 | 0.2312587 | 0.6318743 | 563 | Siblings, NumChildren, Embarked |
| 1 | 2 | 0.2316473 | 0.6363636 | 567 | NumChildren, Embarked |
| 4 | 2 | 0.2327469 | 0.6285073 | 560 | Siblings, FixedAge |
| 5 | 3 | 0.2329559 | 0.6262626 | 558 | Siblings, FixedAge, NumChildren |
| 7 | 1 | 0.2356697 | 0.6161616 | 549 | FixedAge |
| 6 | 2 | 0.2358816 | 0.6150393 | 548 | FixedAge, NumChildren |
| 8 | 1 | 0.2361338 | 0.6161616 | 549 | Siblings |
| 10 | 2 | 0.2366439 | 0.6161616 | 549 | Siblings, NumChildren |
| 9 | 1 | 0.2374988 | 0.6161616 | 549 | NumChildren |
The model with the lowest cross validation error is a model with 8 features: FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren, and Embarked. A logistic regression model is fit using these features and submitted to Kaggle.
best_features = c("pclass", "Sex", "FixedAge", "FixedTitle", "Siblings", "CabinAssignment", "Embarked", "NumChildren" );
best_df = newTrain[, c("Survived", best_features)];
best.glm.fit = glm( Survived~., data=best_df, family=binomial);
trainSetProbs = predict( best.glm.fit, newTrain, type="response"); # now a vector of probabilities
# probabilities >= 0.5 mean survived, < 0.5 mean perished
threshold = 0.5;
predictions = as.numeric( trainSetProbs >= threshold );
summary( best.glm.fit );
##
## Call:
## glm(formula = Survived ~ ., family = binomial, data = best_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3843 -0.5688 -0.3754 0.5284 2.6770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.956973 0.916660 8.680 < 2e-16 ***
## pclass2 -0.579907 0.398655 -1.455 0.14576
## pclass3 -1.594762 0.391739 -4.071 4.68e-05 ***
## Sexmale -3.547781 0.639796 -5.545 2.94e-08 ***
## FixedAge -0.026580 0.009689 -2.743 0.00608 **
## FixedTitleMiss. -3.980327 0.840060 -4.738 2.16e-06 ***
## FixedTitleMr. -3.300974 0.573629 -5.755 8.69e-09 ***
## FixedTitleMrs. -3.235773 0.813646 -3.977 6.98e-05 ***
## Siblings -0.655098 0.134485 -4.871 1.11e-06 ***
## CabinAssignmentUnassigned -0.787387 0.353103 -2.230 0.02575 *
## EmbarkedQ -0.113587 0.392621 -0.289 0.77235
## EmbarkedS -0.485940 0.246626 -1.970 0.04880 *
## NumChildren -0.397312 0.159939 -2.484 0.01299 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 719.17 on 878 degrees of freedom
## AIC: 745.17
##
## Number of Fisher Scoring iterations: 5
# How'd we do?
# now calculate the training set error
correctRate = mean( predictions == newTrain$Survived );
# scored a 0.8395062
# now, how'd we do against the test set?
testSetProbs = predict( best.glm.fit, newTest, type="response");
bestTestPredictions = as.numeric( testSetProbs >= threshold );
bestLogisticRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=predictions);
bestLogisticRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=bestTestPredictions);
bestLogisticRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=trainSetProbs);
bestLogisticRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=testSetProbs);
write.csv( bestLogisticRegressionModelTrainProbs, "LogisticRegressionSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLogisticRegressionModelProbs, "LogisticRegressionSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLogisticRegressionModelTrain, "LogisticRegressionSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLogisticRegressionModel, "LogisticRegressionSigModel.csv", row.names=FALSE, quote=FALSE );
# scored a 0.77990 - not as good as the full model
The model based on significant features scored lower on the test set than the full model. To reduce the model fit to noise in the training data, the regularization methods of ridge regression and lasso are considered.
Ridge regression introduces a coefficient penalty that reduces the effect of the covariates. This can help with variance by making the solution slightly less flexible. The glmnet package includes a method for determining the lambda penalty coefficient using cross validation (default is 10-fold cross validation).
library( glmnet, quietly = TRUE );
## Loaded glmnet 2.0-13
lambdas = c();
set.seed(1);
# use ridge regression
xTrain = model.matrix( Survived~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain,family=binomial);
xTest = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTest,family=binomial);
#xTrain = model.matrix( Survived~.,data=full_df,family=binomial);
full_df_test = newTest[, all_features];
#xTest = model.matrix( ~.,data=full_df_test,family=binomial);
cv.out = cv.glmnet( xTrain, newTrain$Survived, alpha=0, type.measure="class");
plot( cv.out );
lambdaMin = cv.out$lambda.min;
rrmodel = glmnet( xTrain, newTrain$Survived, alpha=0);
train.ridge.pred = predict( rrmodel, s=lambdaMin, newx=xTrain);
trainRidgePredictions = as.numeric( train.ridge.pred >= threshold );
numCorrect = sum(trainRidgePredictions == newTrain$Survived);
accuracy = numCorrect/nrow(newTrain);
ridge.pred = predict( rrmodel, s=lambdaMin, newx=xTest);
fullRidgePredictions = as.numeric( ridge.pred >= threshold );
fullRidgeRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=trainRidgePredictions);
fullRidgeRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=fullRidgePredictions);
fullRidgeRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=train.ridge.pred);
fullRidgeRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=ridge.pred);
write.csv( fullRidgeRegressionModelTrainProbs, "RidgeRegressionFullModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullRidgeRegressionModelProbs, "RidgeRegressionFullModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullRidgeRegressionModelTrain, "RidgeRegressionFullModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( fullRidgeRegressionModel, "RidgeRegressionFullModel.csv", row.names=FALSE, quote=FALSE );
# score is 0.78947
The training set mean squared error is the smallest at a lambda value of 0.02931267. Against the training data, the model achieved an accuracy of 0.8226712.
confusionMatrix(trainRidgePredictions, newTrain$Survived);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 482 91
## 1 67 251
##
## Accuracy : 0.8227
## 95% CI : (0.796, 0.8472)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6201
## Mcnemar's Test P-Value : 0.06728
##
## Sensitivity : 0.8780
## Specificity : 0.7339
## Pos Pred Value : 0.8412
## Neg Pred Value : 0.7893
## Prevalence : 0.6162
## Detection Rate : 0.5410
## Detection Prevalence : 0.6431
## Balanced Accuracy : 0.8059
##
## 'Positive' Class : 0
##
When comparing to the full logistic regression solution, the ridge predictions are the same as the least-squares logistic regression in 398 of the 418 cases (95.2% of the test set). The model scored a 0.78468 upon submission to the Kaggle site - slightly less than the full logistic regression solution (see comparison below).
fullLogisticPredictions = bestTestPredictions;
confusionMatrix( fullRidgePredictions, fullLogisticPredictions );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 245 14
## 1 6 153
##
## Accuracy : 0.9522
## 95% CI : (0.9271, 0.9705)
## No Information Rate : 0.6005
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8995
## Mcnemar's Test P-Value : 0.1175
##
## Sensitivity : 0.9761
## Specificity : 0.9162
## Pos Pred Value : 0.9459
## Neg Pred Value : 0.9623
## Prevalence : 0.6005
## Detection Rate : 0.5861
## Detection Prevalence : 0.6196
## Balanced Accuracy : 0.9461
##
## 'Positive' Class : 0
##
Next, a ridge regression model is fit using only the significant features determined from the logistic regression solution. From the training data, lambda was determined to be 0.02931267. Against the training data, the model achieved an accuracy of 0.8249158.
library( glmnet );
lambdas = c();
set.seed(1);
# use ridge regression
xTrain = model.matrix( Survived~.,data=best_df,family=binomial);
best_df_test = newTest[, best_features];
xTest = model.matrix( ~.,data=best_df_test,family=binomial);
cv.out = cv.glmnet( xTrain, newTrain$Survived, alpha=0, type.measure="class");
plot( cv.out );
lambdaMin = cv.out$lambda.min;
rrmodel = glmnet( xTrain, newTrain$Survived, alpha=0);
train.ridge.pred = predict( rrmodel, s=lambdaMin, newx=xTrain);
trainRidgePredictions = as.numeric( train.ridge.pred >= threshold );
numCorrect = sum(trainRidgePredictions == newTrain$Survived);
accuracy = numCorrect/nrow(newTrain);
ridge.pred = predict( rrmodel, s=lambdaMin, newx=xTest);
bestRidgePredictions = as.numeric( ridge.pred >= threshold );
bestRidgeRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=trainRidgePredictions);
bestRidgeRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=bestRidgePredictions);
bestRidgeRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=as.numeric(train.ridge.pred));
bestRidgeRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=as.numeric(ridge.pred));
write.csv( bestRidgeRegressionModelTrainProbs, "RidgeRegressionSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bestRidgeRegressionModelProbs, "RidgeRegressionSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bestRidgeRegressionModelTrain, "RidgeRegressionSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( bestRidgeRegressionModel, "RidgeRegressionSigModel.csv", row.names=FALSE, quote=FALSE );
When comparing to the full logistic regression solution, the best feature ridge predictions are the same as the least-squares logistic regression in 399 of the 418 cases (95.4% of the test set). The model scored a 0.77990 upon submission to the Kaggle site - which is less than both the full logistic regression and full ridge regression models.
fullLogisticPredictions = bestTestPredictions;
confusionMatrix( bestRidgePredictions, fullLogisticPredictions );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 244 14
## 1 7 153
##
## Accuracy : 0.9498
## 95% CI : (0.9242, 0.9686)
## No Information Rate : 0.6005
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8946
## Mcnemar's Test P-Value : 0.1904
##
## Sensitivity : 0.9721
## Specificity : 0.9162
## Pos Pred Value : 0.9457
## Neg Pred Value : 0.9563
## Prevalence : 0.6005
## Detection Rate : 0.5837
## Detection Prevalence : 0.6172
## Balanced Accuracy : 0.9441
##
## 'Positive' Class : 0
##
Similar to ridge regression, lasso regularization introduces a penalty to reduce the magnitude of coefficient estimates. As with ridge regression, the glmnet implementation of the lasso uses cross validation to estimate the best value of lambda. The Lasso regularization method multiplies lambda with the absolute value of the coefficients as the penalty and, unlike Ridge Regression, can be used to eliminate features altogether.
# on to the lasso
xTrain = model.matrix( Survived~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain,family=binomial);
xTest = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTest,family=binomial);
cv.out = cv.glmnet( xTrain, newTrain$Survived, alpha=1, type.measure = "class");
plot( cv.out );
lambdaMin = cv.out$lambda.min;
lassomodel = glmnet( xTrain, newTrain$Survived, alpha=1);
train.lasso.pred = predict( lassomodel, s=lambdaMin, newx=xTrain);
trainLassoPredictions = as.numeric( train.lasso.pred >= threshold );
numCorrect = sum(trainLassoPredictions == newTrain$Survived);
accuracy = numCorrect/nrow(newTrain);
lasso.pred = predict( lassomodel, s=lambdaMin, newx=xTest);
lassoPredictions = as.numeric( lasso.pred >= threshold );
fullLassoRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=trainLassoPredictions);
fullLassoRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=lassoPredictions);
fullLassoRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=as.numeric(train.lasso.pred));
fullLassoRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=as.numeric(lasso.pred));
write.csv( fullLassoRegressionModelTrainProbs, "LassoRegressionFullModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLassoRegressionModelProbs, "LassoRegressionFullModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLassoRegressionModelTrain, "LassoRegressionFullModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLassoRegressionModel, "LassoRegressionFullModel.csv", row.names=FALSE, quote=FALSE );
In the plot, as Lambda increases, the number of features in the model is reduced (see top axis). In this application, no features were eliminated. The Lasso model using all of the features had a training set error of 0.1638608 with a lambda of 0.00160114 (see confusion matrix below).
confusionMatrix( trainLassoPredictions, newTrain$Survived );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 489 86
## 1 60 256
##
## Accuracy : 0.8361
## 95% CI : (0.8102, 0.8599)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6485
## Mcnemar's Test P-Value : 0.03854
##
## Sensitivity : 0.8907
## Specificity : 0.7485
## Pos Pred Value : 0.8504
## Neg Pred Value : 0.8101
## Prevalence : 0.6162
## Detection Rate : 0.5488
## Detection Prevalence : 0.6453
## Balanced Accuracy : 0.8196
##
## 'Positive' Class : 0
##
When comparing to the full logistic regression model against the test data, the lasso model differed from the full logistic regression model in 11 cases, predicting the same outcome in 407 of 419 rows. Against the Kaggle test data, the model scored 0.78947.
confusionMatrix( lassoPredictions, fullLogisticPredictions );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 247 6
## 1 4 161
##
## Accuracy : 0.9761
## 95% CI : (0.9564, 0.9885)
## No Information Rate : 0.6005
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.95
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.9841
## Specificity : 0.9641
## Pos Pred Value : 0.9763
## Neg Pred Value : 0.9758
## Prevalence : 0.6005
## Detection Rate : 0.5909
## Detection Prevalence : 0.6053
## Balanced Accuracy : 0.9741
##
## 'Positive' Class : 0
##
The lasso model with only the significant features achieved a training set error of 0.1627385 using a lambda of 0.0002067951.
# on to the lasso
xTrain = model.matrix( Survived~.,data=best_df,family=binomial);
best_df_test = newTest[, best_features];
xTest = model.matrix( ~.,data=best_df_test,family=binomial);
cv.out = cv.glmnet( xTrain, newTrain$Survived, alpha=1, type.measure = "class");
plot( cv.out );
lambdaMin = cv.out$lambda.min;
lassomodel = glmnet( xTrain, newTrain$Survived, alpha=1);
train.lasso.pred = predict( lassomodel, s=lambdaMin, newx=xTrain);
trainLassoPredictions = as.numeric( train.lasso.pred >= threshold );
numCorrect = sum(trainLassoPredictions == newTrain$Survived);
accuracy = numCorrect/nrow(newTrain);
lasso.pred = predict( lassomodel, s=lambdaMin, newx=xTest);
lassoPredictions = as.numeric( lasso.pred >= threshold );
bestLassoRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=trainLassoPredictions);
bestLassoRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=lassoPredictions);
bestLassoRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=as.numeric(train.lasso.pred));
bestLassoRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=as.numeric(lasso.pred));
write.csv( bestLassoRegressionModelTrainProbs, "LassoRegressionSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLassoRegressionModelProbs, "LassoRegressionSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLassoRegressionModelTrain, "LassoRegressionSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLassoRegressionModel, "LassoRegressionSigModel.csv", row.names=FALSE, quote=FALSE );
The “best” feature Lasso model had a training set error of 0.1616162 with a lambda of 0.0005754197 (see confusion matrix below).
confusionMatrix( trainLassoPredictions, newTrain$Survived );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 490 86
## 1 59 256
##
## Accuracy : 0.8373
## 95% CI : (0.8114, 0.8609)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6508
## Mcnemar's Test P-Value : 0.03084
##
## Sensitivity : 0.8925
## Specificity : 0.7485
## Pos Pred Value : 0.8507
## Neg Pred Value : 0.8127
## Prevalence : 0.6162
## Detection Rate : 0.5499
## Detection Prevalence : 0.6465
## Balanced Accuracy : 0.8205
##
## 'Positive' Class : 0
##
When comparing to the full logistic regression model against the test data, the “best feature” lasso model differed from the full logistic regression model in 11 cases, predicting the same outcome in 407 of 419 rows. Against the Kaggle test data, the model scored 0.79425.
confusionMatrix( lassoPredictions, fullLogisticPredictions );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 248 8
## 1 3 159
##
## Accuracy : 0.9737
## 95% CI : (0.9534, 0.9868)
## No Information Rate : 0.6005
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9449
## Mcnemar's Test P-Value : 0.2278
##
## Sensitivity : 0.9880
## Specificity : 0.9521
## Pos Pred Value : 0.9688
## Neg Pred Value : 0.9815
## Prevalence : 0.6005
## Detection Rate : 0.5933
## Detection Prevalence : 0.6124
## Balanced Accuracy : 0.9701
##
## 'Positive' Class : 0
##
The best feature lasso model differed from the full logistic regression model in 11 cases and from the best feature ridge regression model in 12 cases (see confusion matrices below).
confusionMatrix( lassoPredictions, fullLogisticPredictions );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 248 8
## 1 3 159
##
## Accuracy : 0.9737
## 95% CI : (0.9534, 0.9868)
## No Information Rate : 0.6005
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9449
## Mcnemar's Test P-Value : 0.2278
##
## Sensitivity : 0.9880
## Specificity : 0.9521
## Pos Pred Value : 0.9688
## Neg Pred Value : 0.9815
## Prevalence : 0.6005
## Detection Rate : 0.5933
## Detection Prevalence : 0.6124
## Balanced Accuracy : 0.9701
##
## 'Positive' Class : 0
##
confusionMatrix( bestRidgePredictions, lassoPredictions );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 251 7
## 1 5 155
##
## Accuracy : 0.9713
## 95% CI : (0.9504, 0.9851)
## No Information Rate : 0.6124
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9394
## Mcnemar's Test P-Value : 0.7728
##
## Sensitivity : 0.9805
## Specificity : 0.9568
## Pos Pred Value : 0.9729
## Neg Pred Value : 0.9688
## Prevalence : 0.6124
## Detection Rate : 0.6005
## Detection Prevalence : 0.6172
## Balanced Accuracy : 0.9686
##
## 'Positive' Class : 0
##
# score is 0.78947
K-Nearest Neighbors is a conceptionally simple method of assigning the most common class label to a point based on the labels of the closest K points. The lower the K, the higher the variance (at an extreme, K=1 will assign the class label based on the closest labeled point). Conversely, the higher the K, the lower the variance and higher the bias. With K as N-1 as an extreme, each point will be labeled as the most common class. This method lends itself nicely to cross-validation to find the best value of K.
Using Leave One Out Cross Validation, the KNN model was trained using all features. The model was cross-validated using 20 different K-values from 5 to 43, odd numbered. The model achieved a training set accuracy of 0.8563 with a K of 9.
library(caret);
trainCtrl = trainControl( method="LOOCV");
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_fit
## k-Nearest Neighbors
##
## 891 samples
## 12 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (16), scaled (16)
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 890, 890, 890, 890, 890, 890, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8193042 0.6152323
## 7 0.8204265 0.6157019
## 9 0.8249158 0.6248907
## 11 0.8136925 0.5990534
## 13 0.8125701 0.5950474
## 15 0.8114478 0.5937668
## 17 0.8080808 0.5844114
## 19 0.8002245 0.5661743
## 21 0.8148148 0.5971739
## 23 0.8204265 0.6087150
## 25 0.8215488 0.6104966
## 27 0.8215488 0.6104966
## 29 0.8114478 0.5896171
## 31 0.8092031 0.5837867
## 33 0.8125701 0.5899655
## 35 0.8114478 0.5877454
## 37 0.8103255 0.5850549
## 39 0.8136925 0.5917222
## 41 0.8103255 0.5836299
## 43 0.8069585 0.5769651
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
plot( knn_fit );
test_pred = predict( knn_fit, newdata=newTrain);
confusionMatrix( test_pred, newTrain$SurvivedFactor)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 495 74
## 1 54 268
##
## Accuracy : 0.8563
## 95% CI : (0.8316, 0.8787)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6929
## Mcnemar's Test P-Value : 0.09308
##
## Sensitivity : 0.9016
## Specificity : 0.7836
## Pos Pred Value : 0.8699
## Neg Pred Value : 0.8323
## Prevalence : 0.6162
## Detection Rate : 0.5556
## Detection Prevalence : 0.6386
## Balanced Accuracy : 0.8426
##
## 'Positive' Class : 0
##
Against the Kaggle test data, the model scored 0.76555.
# now the whole training set
knnTrain = newTrain;
knnTest = newTest;
trainCtrl = trainControl( method="LOOCV" );
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=knnTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_train_pred = predict( knn_fit, newdata=knnTrain);
knn_pred = predict( knn_fit, newdata=knnTest);
fullKNNModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(knn_train_pred)-1);
fullKNNModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(knn_pred)-1);
write.csv( fullKNNModelTrain, "KNNFullModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullKNNModel, "KNNFullModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullKNNModelTrain, "KNNFullModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( fullKNNModel, "KNNFullModel.csv", row.names=FALSE, quote=FALSE );
Since some features may not be informative and all features are used in finding closest points, it may be better to consider removing features that just add noise. The idea is to start with all significant features and then consider combinations as subsets for analysis.
significant_features = c("FixedTitle", "Sex", "Siblings", "FixedAge", "CabinAssignment", "pclass", "NumChildren", "FarePerPassenger", "Embarked");
all_combinations_of_significant_features = lapply(1:length(significant_features), function(x) combn(length(significant_features),x));
numFeatureSetClasses = length( all_combinations_of_significant_features );
correct = list();
accuracies = list();
errors = list();
features = list();
numFeatures = list();
k = list();
counter = 1;
numTrain = nrow( newTrain );
trainCtrl = trainControl( method="repeatedcv", number = 8, repeats = 1 );
for ( featureClass in 1:numFeatureSetClasses){
featClassColumns = dim( all_combinations_of_significant_features[[featureClass]] )[2];
for ( thisCol in 1:featClassColumns ){
feature_set = significant_features[ all_combinations_of_significant_features[[featureClass]][,thisCol]];
nFeatures = length( feature_set );
df = newTrain[, c("SurvivedFactor", feature_set)]
knn_fit = train( SurvivedFactor~.,data=df,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_pred = predict( knn_fit, newdata=newTrain);
numRight = sum( knn_pred == newTrain$SurvivedFactor );
cv.error = ( numTrain - numRight )/numTrain;
accuracy = mean( knn_pred == newTrain$SurvivedFactor );
features[counter] = list( feature_set );
numFeatures[counter] = nFeatures;
accuracies[counter] = accuracy;
correct[counter] = numRight;
errors[counter] = cv.error;
k[counter] = knn_fit$bestTune$k;
counter = counter + 1;
}
}
cvResults = data.frame( NumFeatures=unlist(numFeatures), CVError=unlist(errors), TrainingScore=unlist(accuracies), NumCorrect=unlist(correct), K=unlist(k), Features=I(features) );
cvResults = cvResults[order(cvResults$CVError),];
knitr::kable( head(cvResults,n=20L), format="markdown", longtable=TRUE);
| NumFeatures | CVError | TrainingScore | NumCorrect | K | Features | |
|---|---|---|---|---|---|---|
| 504 | 8 | 0.1189675 | 0.8810325 | 785 | 5 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger, Embarked |
| 387 | 6 | 0.1234568 | 0.8765432 | 781 | 5 | FixedTitle, Sex, Siblings, FixedAge, pclass, FarePerPassenger |
| 467 | 7 | 0.1234568 | 0.8765432 | 781 | 5 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger |
| 472 | 7 | 0.1245791 | 0.8754209 | 780 | 5 | FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren, FarePerPassenger |
| 502 | 8 | 0.1268238 | 0.8731762 | 778 | 5 | FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren, FarePerPassenger |
| 506 | 8 | 0.1268238 | 0.8731762 | 778 | 5 | FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren, FarePerPassenger, Embarked |
| 444 | 6 | 0.1290685 | 0.8709315 | 776 | 5 | Sex, Siblings, FixedAge, pclass, NumChildren, FarePerPassenger |
| 507 | 8 | 0.1290685 | 0.8709315 | 776 | 5 | FixedTitle, Sex, Siblings, CabinAssignment, pclass, NumChildren, FarePerPassenger, Embarked |
| 331 | 5 | 0.1301908 | 0.8698092 | 775 | 5 | Sex, Siblings, FixedAge, pclass, FarePerPassenger |
| 403 | 6 | 0.1301908 | 0.8698092 | 775 | 5 | FixedTitle, Sex, FixedAge, CabinAssignment, pclass, FarePerPassenger |
| 481 | 7 | 0.1301908 | 0.8698092 | 775 | 5 | FixedTitle, Sex, FixedAge, CabinAssignment, pclass, NumChildren, FarePerPassenger |
| 496 | 7 | 0.1301908 | 0.8698092 | 775 | 5 | Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger, Embarked |
| 418 | 6 | 0.1313131 | 0.8686869 | 774 | 5 | FixedTitle, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger |
| 483 | 7 | 0.1313131 | 0.8686869 | 774 | 5 | FixedTitle, Sex, FixedAge, CabinAssignment, pclass, FarePerPassenger, Embarked |
| 389 | 6 | 0.1324355 | 0.8675645 | 773 | 5 | FixedTitle, Sex, Siblings, FixedAge, NumChildren, FarePerPassenger |
| 391 | 6 | 0.1335578 | 0.8664422 | 772 | 5 | FixedTitle, Sex, Siblings, FixedAge, FarePerPassenger, Embarked |
| 410 | 6 | 0.1335578 | 0.8664422 | 772 | 5 | FixedTitle, Sex, FixedAge, pclass, FarePerPassenger, Embarked |
| 420 | 6 | 0.1335578 | 0.8664422 | 772 | 5 | FixedTitle, Siblings, FixedAge, CabinAssignment, NumChildren, FarePerPassenger |
| 439 | 6 | 0.1335578 | 0.8664422 | 772 | 5 | Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger |
| 441 | 6 | 0.1335578 | 0.8664422 | 772 | 5 | Sex, Siblings, FixedAge, CabinAssignment, NumChildren, FarePerPassenger |
Now, take the top 10 results and repeat the analysis using LOOCV. LOOCV provides a better estimate of the error but is computationally costly.
correct = list();
accuracies = list();
errors = list();
features = list();
numFeatures = list();
k = list();
counter = 1;
numTrain = nrow( newTrain );
trainCtrl = trainControl( method="LOOCV" );
for ( rowNum in 1:10 ){
feature_set = cvResults$Features[[rowNum]];
nFeatures = cvResults$NumFeatures[rowNum];
df = newTrain[, c("SurvivedFactor", feature_set)]
knn_fit = train( SurvivedFactor~.,data=df,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_pred = predict( knn_fit, newdata=newTrain);
numRight = sum( knn_pred == newTrain$SurvivedFactor );
cv.error = ( numTrain - numRight )/numTrain;
accuracy = mean( knn_pred == newTrain$SurvivedFactor );
features[counter] = list( feature_set );
numFeatures[counter] = nFeatures;
accuracies[counter] = accuracy;
correct[counter] = numRight;
errors[counter] = cv.error;
k[counter] = knn_fit$bestTune$k;
counter = counter + 1;
cat( counter, "\n")
}
loocvResults = data.frame( NumFeatures=unlist(numFeatures), CVError=unlist(errors), TrainingScore=unlist(accuracies), NumCorrect=unlist(correct), K=unlist(k), Features=I(features) );
loocvResults = loocvResults[order(loocvResults$CVError),];
knitr::kable( head(loocvResults,n=10L), format="markdown", longtable=TRUE);
Now, take the best 9, 8, 7, and 6 knn models.
With 9 features, the model achieves a training set accuracy of 0.8754 using a cross-validated K of 5. On the test set, the model scored 0.76076.
# best 9
knnTrain = newTrain;
knnTest = newTest;
trainCtrl = trainControl( method="LOOCV" );
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumChildren+Siblings,data=knnTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_fit
## k-Nearest Neighbors
##
## 891 samples
## 9 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 890, 890, 890, 890, 890, 890, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8282828 0.6314868
## 7 0.8249158 0.6232068
## 9 0.8002245 0.5681306
## 11 0.8159371 0.6029930
## 13 0.8193042 0.6109118
## 15 0.8013468 0.5736808
## 17 0.7979798 0.5647501
## 19 0.7946128 0.5592293
## 21 0.8013468 0.5722446
## 23 0.7991021 0.5669249
## 25 0.7946128 0.5557490
## 27 0.7923681 0.5493689
## 29 0.7923681 0.5478336
## 31 0.7912458 0.5430542
## 33 0.7957351 0.5518579
## 35 0.8024691 0.5666318
## 37 0.8013468 0.5639201
## 39 0.7979798 0.5562749
## 41 0.8013468 0.5639201
## 43 0.8002245 0.5622081
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot( knn_fit )
# training set performance
knn_train_pred = predict( knn_fit, newdata=knnTrain);
confusionMatrix(knn_train_pred, knnTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 502 65
## 1 47 277
##
## Accuracy : 0.8743
## 95% CI : (0.8507, 0.8954)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7316
## Mcnemar's Test P-Value : 0.1082
##
## Sensitivity : 0.9144
## Specificity : 0.8099
## Pos Pred Value : 0.8854
## Neg Pred Value : 0.8549
## Prevalence : 0.6162
## Detection Rate : 0.5634
## Detection Prevalence : 0.6364
## Balanced Accuracy : 0.8622
##
## 'Positive' Class : 0
##
# test set performance
knn_pred = predict( knn_fit, newdata=knnTest);
best9KNNModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(knn_train_pred)-1);
best9KNNModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(knn_pred)-1);
write.csv( best9KNNModelTrain, "KNNBest9ModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best9KNNModel, "KNNBest9ModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best9KNNModelTrain, "KNNBest9ModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( best9KNNModel, "KNNBest9Model.csv", row.names=FALSE, quote=FALSE );
With 8 features, the model achieves a training set accuracy of 0.881 using a cross-validated K of 5. On the test set, the model scored 0.73684.
# best 8
trainCtrl = trainControl( method="LOOCV" );
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+Siblings,data=knnTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_fit
## k-Nearest Neighbors
##
## 891 samples
## 8 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (12), scaled (12)
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 890, 890, 890, 890, 890, 890, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8372615 0.6499720
## 7 0.8249158 0.6227834
## 9 0.8080808 0.5853479
## 11 0.8181818 0.6078345
## 13 0.8170595 0.6078398
## 15 0.8103255 0.5924934
## 17 0.8002245 0.5676432
## 19 0.7968575 0.5601080
## 21 0.8058361 0.5766826
## 23 0.8047138 0.5730220
## 25 0.8035915 0.5683534
## 27 0.7957351 0.5503145
## 29 0.7934905 0.5438017
## 31 0.7867565 0.5256468
## 33 0.7867565 0.5261965
## 35 0.7867565 0.5289256
## 37 0.7957351 0.5518579
## 39 0.8002245 0.5627078
## 41 0.7979798 0.5567826
## 43 0.8013468 0.5644187
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot( knn_fit )
# training set performance
knn_train_pred = predict( knn_fit, newdata=knnTrain);
confusionMatrix(knn_train_pred, knnTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 506 65
## 1 43 277
##
## Accuracy : 0.8788
## 95% CI : (0.8555, 0.8995)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7406
## Mcnemar's Test P-Value : 0.04331
##
## Sensitivity : 0.9217
## Specificity : 0.8099
## Pos Pred Value : 0.8862
## Neg Pred Value : 0.8656
## Prevalence : 0.6162
## Detection Rate : 0.5679
## Detection Prevalence : 0.6409
## Balanced Accuracy : 0.8658
##
## 'Positive' Class : 0
##
# test set performance
knn_pred = predict( knn_fit, newdata=knnTest);
best8KNNModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(knn_train_pred)-1);
best8KNNModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(knn_pred)-1);
write.csv( best8KNNModelTrain, "KNNBest8ModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best8KNNModel, "KNNBest8ModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best8KNNModelTrain, "KNNBest8ModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( best8KNNModel, "KNNBest8Model.csv", row.names=FALSE, quote=FALSE );
With 7 features, the model achieves a training set accuracy of 0.8765 using a cross-validated K of 5. On the test set, the model scored 0.73684.
# best 7
trainCtrl = trainControl( method="LOOCV" );
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Siblings,data=knnTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_fit
## k-Nearest Neighbors
##
## 891 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 890, 890, 890, 890, 890, 890, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8327722 0.6431181
## 7 0.8148148 0.6034713
## 9 0.8193042 0.6104744
## 11 0.8170595 0.6069611
## 13 0.8260382 0.6287470
## 15 0.8226712 0.6234473
## 17 0.8058361 0.5860944
## 19 0.8103255 0.5943115
## 21 0.8002245 0.5724683
## 23 0.8013468 0.5760530
## 25 0.7957351 0.5633482
## 27 0.7991021 0.5712627
## 29 0.8013468 0.5760530
## 31 0.7968575 0.5650200
## 33 0.7890011 0.5464249
## 35 0.7968575 0.5625778
## 37 0.7968575 0.5625778
## 39 0.7676768 0.5052987
## 41 0.7789001 0.5323134
## 43 0.7766554 0.5254715
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot( knn_fit )
# training set performance
knn_train_pred = predict( knn_fit, newdata=knnTrain);
confusionMatrix(knn_train_pred, knnTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 499 61
## 1 50 281
##
## Accuracy : 0.8754
## 95% CI : (0.8519, 0.8964)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.735
## Mcnemar's Test P-Value : 0.3425
##
## Sensitivity : 0.9089
## Specificity : 0.8216
## Pos Pred Value : 0.8911
## Neg Pred Value : 0.8489
## Prevalence : 0.6162
## Detection Rate : 0.5600
## Detection Prevalence : 0.6285
## Balanced Accuracy : 0.8653
##
## 'Positive' Class : 0
##
# test set performance
knn_pred = predict( knn_fit, newdata=knnTest);
best7KNNModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(knn_train_pred)-1);
best7KNNModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(knn_pred)-1);
write.csv( best7KNNModelTrain, "KNNBest7ModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best7KNNModel, "KNNBest7ModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best7KNNModelTrain, "KNNBest7ModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( best7KNNModel, "KNNBest7Model.csv", row.names=FALSE, quote=FALSE );
With 6 features, the model achieves a training set accuracy of 0.8765 using a cross-validated K of 5. On the test set, the model scored 0.71770.
# best 6
trainCtrl = trainControl( method="LOOCV" );
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+Siblings,data=knnTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_fit
## k-Nearest Neighbors
##
## 891 samples
## 6 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (9), scaled (9)
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 890, 890, 890, 890, 890, 890, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8439955 0.6674400
## 7 0.8204265 0.6182581
## 9 0.8204265 0.6144109
## 11 0.8193042 0.6126516
## 13 0.8271605 0.6321656
## 15 0.8271605 0.6309368
## 17 0.8136925 0.6039427
## 19 0.7968575 0.5650200
## 21 0.7923681 0.5563979
## 23 0.7845118 0.5383270
## 25 0.7845118 0.5372931
## 27 0.7811448 0.5308513
## 29 0.7833895 0.5356631
## 31 0.7878788 0.5468065
## 33 0.7957351 0.5643196
## 35 0.7957351 0.5633482
## 37 0.7901235 0.5485845
## 39 0.7901235 0.5485845
## 41 0.7744108 0.5169559
## 43 0.7732884 0.5196483
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot( knn_fit )
# training set performance
knn_train_pred = predict( knn_fit, newdata=knnTrain);
confusionMatrix(knn_train_pred, knnTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 497 59
## 1 52 283
##
## Accuracy : 0.8754
## 95% CI : (0.8519, 0.8964)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7356
## Mcnemar's Test P-Value : 0.569
##
## Sensitivity : 0.9053
## Specificity : 0.8275
## Pos Pred Value : 0.8939
## Neg Pred Value : 0.8448
## Prevalence : 0.6162
## Detection Rate : 0.5578
## Detection Prevalence : 0.6240
## Balanced Accuracy : 0.8664
##
## 'Positive' Class : 0
##
# test set performance
knn_pred = predict( knn_fit, newdata=knnTest);
best6KNNModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(knn_train_pred)-1);
best6KNNModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(knn_pred)-1);
write.csv( best6KNNModelTrain, "KNNBest6ModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best6KNNModel, "KNNBest6ModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best6KNNModelTrain, "KNNBest6ModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( best6KNNModel, "KNNBest6Model.csv", row.names=FALSE, quote=FALSE );
The test set score of 0.71770 is less than that of the simple Gender Submission Model (0.76555).
A classification tree is used to model the data. Since the features contain nominal data (Embarked, Sex, for example), a decision tree is appropriate. After the tree is fit, terminal nodes are pruned using cross validation.
The tree model scored 0.8507 against the training data and 0.7727 against the test data.
library(tree);
tree_model = tree( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain );
summary( tree_model )
##
## Classification tree:
## tree(formula = SurvivedFactor ~ pclass + Sex + FixedAge + FarePerPassenger +
## FixedTitle + CabinAssignment + Embarked + NumPassengersOnTicket +
## NumParents + NumChildren + SpousesFactor + Siblings, data = newTrain)
## Variables actually used in tree construction:
## [1] "FixedTitle" "FarePerPassenger" "pclass"
## [4] "NumPassengersOnTicket" "Sex"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7585 = 669 / 882
## Misclassification error rate: 0.1493 = 133 / 891
cv_tree_model = cv.tree( tree_model, FUN=prune.misclass);
best_terminal_node_idx = which.min(cv_tree_model$dev);
best_terminal_nodes = cv_tree_model$size[best_terminal_node_idx];
pruned_tree = prune.misclass(tree_model,best=best_terminal_nodes);
plot( pruned_tree );
text( pruned_tree, pretty=0);
# training set performance against pruned tree
pruned_tree_train_pred = predict( pruned_tree, newTrain, type="class");
confusionMatrix(pruned_tree_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 507 91
## 1 42 251
##
## Accuracy : 0.8507
## 95% CI : (0.8256, 0.8735)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6757
## Mcnemar's Test P-Value : 3.153e-05
##
## Sensitivity : 0.9235
## Specificity : 0.7339
## Pos Pred Value : 0.8478
## Neg Pred Value : 0.8567
## Prevalence : 0.6162
## Detection Rate : 0.5690
## Detection Prevalence : 0.6712
## Balanced Accuracy : 0.8287
##
## 'Positive' Class : 0
##
# test set performance
tree_pred = predict( pruned_tree, newTest, type="class");
treeModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(pruned_tree_train_pred)-1);
treeModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(tree_pred)-1);
write.csv( treeModelTrain, "TreeModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModel, "TreeModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModelTrain, "TreeModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModel, "TreeModel.csv", row.names=FALSE, quote=FALSE );
Next, a tree model using only significant features is considered. Again, cross-validation is used to prune the tree. The model achieved an accuracy of 0.8541 on the training data and scored 0.7608 against the test data.
library(tree);
tree_model = tree( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumChildren+Siblings,data=newTrain );
summary( tree_model )
##
## Classification tree:
## tree(formula = SurvivedFactor ~ pclass + Sex + FixedAge + FarePerPassenger +
## FixedTitle + CabinAssignment + Embarked + NumChildren + Siblings,
## data = newTrain)
## Variables actually used in tree construction:
## [1] "FixedTitle" "FarePerPassenger" "pclass"
## [4] "Siblings" "FixedAge" "Sex"
## Number of terminal nodes: 10
## Residual mean deviance: 0.7536 = 663.9 / 881
## Misclassification error rate: 0.1459 = 130 / 891
cv_tree_model = cv.tree( tree_model, FUN=prune.misclass);
best_terminal_node_idx = which.min(cv_tree_model$dev);
best_terminal_nodes = cv_tree_model$size[best_terminal_node_idx];
pruned_tree = prune.misclass(tree_model,best=best_terminal_nodes);
plot( pruned_tree );
text( pruned_tree, pretty=0);
# training set performance against pruned tree
pruned_tree_train_pred = predict( pruned_tree, newTrain, type="class");
confusionMatrix(pruned_tree_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 510 91
## 1 39 251
##
## Accuracy : 0.8541
## 95% CI : (0.8292, 0.8766)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6824
## Mcnemar's Test P-Value : 7.713e-06
##
## Sensitivity : 0.9290
## Specificity : 0.7339
## Pos Pred Value : 0.8486
## Neg Pred Value : 0.8655
## Prevalence : 0.6162
## Detection Rate : 0.5724
## Detection Prevalence : 0.6745
## Balanced Accuracy : 0.8314
##
## 'Positive' Class : 0
##
# test set performance
tree_pred = predict( pruned_tree, newTest, type="class");
# scored 0.7608
treeModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(pruned_tree_train_pred)-1);
treeModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(tree_pred)-1);
write.csv( treeModelTrain, "TreeSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModel, "TreeSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModelTrain, "TreeSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModel, "TreeSigModel.csv", row.names=FALSE, quote=FALSE );
Bootstrap Aggregation is a method of resampling the data many times to generate a series of samples from the original training set. This allows for the fitting of many trees. In Bagging tree models, the results of these trees are accumulated and the class that a particular sample assumes most often is the winner (by majority vote). The averaging of many trees reduces variance.
The Bagging model scored an impressive 0.9877 on the training set but only 0.7321 against the test set. The high train set accuracy and relatively low test set accuracy is indicative of over-training.
library(randomForest);
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
set.seed(1);
bag_model = randomForest( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain, mtry=12, importance=TRUE );
bag_model
##
## Call:
## randomForest(formula = SurvivedFactor ~ pclass + Sex + FixedAge + FarePerPassenger + FixedTitle + CabinAssignment + Embarked + NumPassengersOnTicket + NumParents + NumChildren + SpousesFactor + Siblings, data = newTrain, mtry = 12, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 17.96%
## Confusion matrix:
## 0 1 class.error
## 0 479 70 0.1275046
## 1 90 252 0.2631579
plot( bag_model )
# Get training set performance
bag_train_pred = predict( bag_model, newdata=newTrain);
confusionMatrix( bag_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 546 8
## 1 3 334
##
## Accuracy : 0.9877
## 95% CI : (0.978, 0.9938)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9738
## Mcnemar's Test P-Value : 0.2278
##
## Sensitivity : 0.9945
## Specificity : 0.9766
## Pos Pred Value : 0.9856
## Neg Pred Value : 0.9911
## Prevalence : 0.6162
## Detection Rate : 0.6128
## Detection Prevalence : 0.6218
## Balanced Accuracy : 0.9856
##
## 'Positive' Class : 0
##
# now test set
bag_pred = predict( bag_model, newdata=newTest);
bagModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(bag_train_pred)-1);
bagModel = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(bag_pred)-1);
write.csv( bagModelTrain, "BagModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModel, "BagModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModelTrain, "BagModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModel, "BagModel.csv", row.names=FALSE, quote=FALSE );
# scored 0.73205
Plot shows the OOB error (black) as well as the class errors (0/Died = red, 1/Survived = green).
Like the full featured model, the significant feature bagging model scored an impressive 0.9854 on the training set but only 0.7273 against the test set. The high train set accuracy and relatively low test set accuracy is indicative of over-training.
library(randomForest);
set.seed(1);
bag_model = randomForest( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumChildren+Siblings,data=newTrain, mtry=9, importance=TRUE );
bag_model
##
## Call:
## randomForest(formula = SurvivedFactor ~ pclass + Sex + FixedAge + FarePerPassenger + FixedTitle + CabinAssignment + Embarked + NumChildren + Siblings, data = newTrain, mtry = 9, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 9
##
## OOB estimate of error rate: 18.07%
## Confusion matrix:
## 0 1 class.error
## 0 481 68 0.1238616
## 1 93 249 0.2719298
plot( bag_model )
# Get training set performance
bag_train_pred = predict( bag_model, newdata=newTrain);
confusionMatrix( bag_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 544 8
## 1 5 334
##
## Accuracy : 0.9854
## 95% CI : (0.9752, 0.9922)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9691
## Mcnemar's Test P-Value : 0.5791
##
## Sensitivity : 0.9909
## Specificity : 0.9766
## Pos Pred Value : 0.9855
## Neg Pred Value : 0.9853
## Prevalence : 0.6162
## Detection Rate : 0.6105
## Detection Prevalence : 0.6195
## Balanced Accuracy : 0.9838
##
## 'Positive' Class : 0
##
# now test set
bag_pred = predict( bag_model, newdata=newTest);
bagModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(bag_train_pred)-1);
bagModel = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(bag_pred)-1);
write.csv( bagModelTrain, "BagSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModel, "BagSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModelTrain, "BagSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModel, "BagSigModel.csv", row.names=FALSE, quote=FALSE );
# scored 0.7273
Plot shows the OOB error (black) as well as the class errors (0/Died = red, 1/Survived = green).
A random forest is much like the bagging idea but has the constraint that each branch point is allowed a random selection of m features to use. This random selection of predictors decorrelates the trees from one another and creates a much more diverse set of trees than does bagging. As a rule of thumb, the number of predictors to choose from is set to the square root of the number of available predictors (for classification trees).
Using the standard sqrt(predictors) = 3 for mtry resulted in another overfit model. But reducing the mtry to 1 (the number of predictors considered at each branch point), the random forest model scored 0.8575 against the training data and 0.7847 against the test data.
library(randomForest);
set.seed(1);
rf_model = randomForest( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain, mtry=1, importance=TRUE );
rf_model
##
## Call:
## randomForest(formula = SurvivedFactor ~ pclass + Sex + FixedAge + FarePerPassenger + FixedTitle + CabinAssignment + Embarked + NumPassengersOnTicket + NumParents + NumChildren + SpousesFactor + Siblings, data = newTrain, mtry = 1, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 17.62%
## Confusion matrix:
## 0 1 class.error
## 0 506 43 0.07832423
## 1 114 228 0.33333333
importance( rf_model );
## 0 1 MeanDecreaseAccuracy
## pclass 7.731820 13.2196507 14.943597
## Sex 15.218208 21.1711598 19.614457
## FixedAge 9.195339 9.0631091 13.216547
## FarePerPassenger 8.601746 11.9774619 15.137825
## FixedTitle 14.129828 18.4765974 18.439136
## CabinAssignment 10.028316 8.7620830 13.199128
## Embarked 4.265670 7.0988848 8.023670
## NumPassengersOnTicket 12.110813 8.7586309 15.309226
## NumParents 10.283945 5.0274583 12.075322
## NumChildren 5.132610 3.2312246 6.550804
## SpousesFactor 4.192165 0.6604533 4.439363
## Siblings 14.902568 2.2274841 14.908893
## MeanDecreaseGini
## pclass 13.317381
## Sex 35.371304
## FixedAge 12.770873
## FarePerPassenger 18.733637
## FixedTitle 36.625948
## CabinAssignment 10.341210
## Embarked 5.687204
## NumPassengersOnTicket 12.360933
## NumParents 3.886940
## NumChildren 3.043912
## SpousesFactor 2.148494
## Siblings 6.499895
varImpPlot( rf_model );
plot( rf_model )
# training set performance
rf_train_pred = predict( rf_model, newdata=newTrain);
confusionMatrix( rf_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 524 102
## 1 25 240
##
## Accuracy : 0.8575
## 95% CI : (0.8328, 0.8798)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6853
## Mcnemar's Test P-Value : 1.542e-11
##
## Sensitivity : 0.9545
## Specificity : 0.7018
## Pos Pred Value : 0.8371
## Neg Pred Value : 0.9057
## Prevalence : 0.6162
## Detection Rate : 0.5881
## Detection Prevalence : 0.7026
## Balanced Accuracy : 0.8281
##
## 'Positive' Class : 0
##
# test set performance
rf_pred = predict( rf_model, newdata=newTest);
rfModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(rf_train_pred)-1);
rfModel = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(rf_pred)-1);
write.csv( rfModelTrain, "RandomForestModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModelTrain, "RandomForestModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestModel.csv", row.names=FALSE, quote=FALSE );
# scored 0.7847
Using the standard sqrt(predictors) = 3 for mtry resulted in another overfit model. But reducing the mtry to 1 (the number of predictors considered at each branch point), the random forest model scored 0.8608 against the training data and 0.7823 against the test data.
library(randomForest);
set.seed(1);
rf_model = randomForest( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumChildren+Siblings,data=newTrain, importance=TRUE, mtry = 1 );
rf_model
##
## Call:
## randomForest(formula = SurvivedFactor ~ pclass + Sex + FixedAge + FarePerPassenger + FixedTitle + CabinAssignment + Embarked + NumChildren + Siblings, data = newTrain, importance = TRUE, mtry = 1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 17.51%
## Confusion matrix:
## 0 1 class.error
## 0 502 47 0.0856102
## 1 109 233 0.3187135
importance( rf_model );
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## pclass 9.333196 14.819291 17.015375 16.386503
## Sex 16.023715 24.064663 22.274436 43.866722
## FixedAge 12.755309 10.706769 16.336319 15.238069
## FarePerPassenger 8.468047 11.854202 16.431479 21.697107
## FixedTitle 15.963874 20.103090 20.571212 48.303153
## CabinAssignment 10.567478 7.894701 13.556124 12.139451
## Embarked 4.083765 9.727112 9.901060 5.880477
## NumChildren 6.777913 2.930898 7.832526 3.771433
## Siblings 16.768459 3.949508 16.413947 7.036073
varImpPlot( rf_model );
plot( rf_model )
# training set performance
rf_train_pred = predict( rf_model, newdata=newTrain);
confusionMatrix( rf_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 516 91
## 1 33 251
##
## Accuracy : 0.8608
## 95% CI : (0.8363, 0.8829)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6961
## Mcnemar's Test P-Value : 3.076e-07
##
## Sensitivity : 0.9399
## Specificity : 0.7339
## Pos Pred Value : 0.8501
## Neg Pred Value : 0.8838
## Prevalence : 0.6162
## Detection Rate : 0.5791
## Detection Prevalence : 0.6813
## Balanced Accuracy : 0.8369
##
## 'Positive' Class : 0
##
# test set performance
rf_pred = predict( rf_model, newdata=newTest);
rfModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(rf_train_pred)-1);
rfModel = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(rf_pred)-1);
write.csv( rfModelTrain, "RandomForestSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModelTrain, "RandomForestSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestSigModel.csv", row.names=FALSE, quote=FALSE );
# scored 0.7823
The Random forest model using all features generated a variable importance plot. The variable set that looked to be most informative is selected and used in this model. This model scored 0.8485 on the training set and 0.7775 on the test set.
library(randomForest);
set.seed(1);
rf_model = randomForest( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+NumParents+Siblings+NumPassengersOnTicket,data=newTrain, importance=TRUE, mtry=1 );
rf_model
##
## Call:
## randomForest(formula = SurvivedFactor ~ pclass + Sex + FixedAge + FarePerPassenger + FixedTitle + CabinAssignment + NumParents + Siblings + NumPassengersOnTicket, data = newTrain, importance = TRUE, mtry = 1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 16.5%
## Confusion matrix:
## 0 1 class.error
## 0 493 56 0.1020036
## 1 91 251 0.2660819
importance( rf_model );
## 0 1 MeanDecreaseAccuracy
## pclass 8.319264 11.920284 15.35410
## Sex 16.526483 23.762523 22.64204
## FixedAge 10.121820 9.609194 15.55474
## FarePerPassenger 8.057060 14.282136 17.34949
## FixedTitle 16.500111 22.101773 22.35903
## CabinAssignment 8.616449 6.996768 12.01374
## NumParents 10.059492 4.893490 11.73343
## Siblings 15.261830 4.924502 16.26732
## NumPassengersOnTicket 12.293240 10.961389 16.89836
## MeanDecreaseGini
## pclass 15.668979
## Sex 40.127918
## FixedAge 14.342667
## FarePerPassenger 23.537830
## FixedTitle 48.522611
## CabinAssignment 11.741865
## NumParents 4.153769
## Siblings 7.916815
## NumPassengersOnTicket 15.004035
varImpPlot( rf_model );
plot( rf_model )
# training set performance
rf_train_pred = predict( rf_model, newdata=newTrain);
confusionMatrix( rf_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 499 85
## 1 50 257
##
## Accuracy : 0.8485
## 95% CI : (0.8232, 0.8714)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6734
## Mcnemar's Test P-Value : 0.003431
##
## Sensitivity : 0.9089
## Specificity : 0.7515
## Pos Pred Value : 0.8545
## Neg Pred Value : 0.8371
## Prevalence : 0.6162
## Detection Rate : 0.5600
## Detection Prevalence : 0.6554
## Balanced Accuracy : 0.8302
##
## 'Positive' Class : 0
##
# test set performance
rf_pred = predict( rf_model, newdata=newTest);
rfModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(rf_train_pred)-1);
rfModel = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(rf_pred)-1);
write.csv( rfModelTrain, "RandomForestTrimmedModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestTrimmedModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModelTrain, "RandomForestTrimmedModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestTrimmedModel.csv", row.names=FALSE, quote=FALSE );
# scored 0.7775
Gradient Boosting applied to classification trees involves fitting classification trees sequentially. That is, instead of generating a bunch of trees and averaging them as with bagging and random forests, boosting involves adding trees sequentially into a single model. Trees are successively fit to the residuals of the previous tree ensemble.
Since there are many parameters to consider for boosted classification trees, cross-validation is used to select the values which will generalize best. This analysis is done using all predictors.
# This is not run; ignore
library(gbm, quietly = TRUE);
tree_lengths = 1:4;
best_error = Inf;
best_idx = 0;
min_error_idx = tree_lengths;
min_errors = tree_lengths;
xTrain = newTrain[,c("Survived","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = newTest[,c("Survived","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
numSamples = 10;
trainingProportion = 0.8;
threshold = 0.5;
cvFolds = 8; # n-fold validation
nTrees = 15000;
treeDiv = 1000;
numIntTrees = nTrees/treeDiv - 1;
nCores = 7; # adjust according to computer
trainMat = matrix( nrow=numSamples, ncol=numIntTrees);
testMat = matrix( nrow=numSamples, ncol=numIntTrees);
for ( depth in tree_lengths ){
set.seed(depth);
sampleSet = createDataPartition(1:nrow(newTrain),numSamples,p=trainingProportion);
for ( samp in 1:numSamples){
trainSample = sampleSet[[samp]];
testSample = setdiff(1:nrow(newTrain),trainSample);
boosted_model = gbm( formula=Survived~.,data=xTrain[trainSample,], distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = depth, n.cores = nCores, verbose=FALSE );
best_tree_num = gbm.perf( boosted_model );
train_scores = 0;
test_scores = 0;
for ( ntree in 1:numIntTrees){
treelen = (ntree-1)*treeDiv;
# now that the model is trained, see how it does on training set for n trees.
train_pred = predict( boosted_model, newdata=xTrain[trainSample,], n.trees=treelen, type="response");
train_predictions = as.numeric( train_pred >= threshold );
train_score = sum( train_predictions == newTrain$Survived[trainSample] )/length(trainSample);
train_scores[ ntree ] = train_score;
trainMat[samp,ntree] = train_score;
cat( "trainMat[", samp, ",", ntree, "] = ", train_score, "\n");
# now that the model is trained, see how it does on test set for n trees.
test_pred = predict( boosted_model, newdata=xTrain[testSample,], n.trees=treelen, type="response");
test_predictions = as.numeric( test_pred >= threshold );
test_score = sum( test_predictions == newTrain$Survived[testSample] )/length(testSample);
test_scores[ ntree ] = test_score;
testMat[samp,ntree] = test_score;
}
}
boosted_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = depth, n.cores = nCores, verbose=FALSE );
min_error_idx[depth] = which.min( boosted_model$cv.error );
min_errors[depth] = min( boosted_model$cv.error );
if ( min_errors[depth] < best_error ){
best_error = min_errors[depth];
best_idx = depth;
}
}
gbtrees = data.frame( TreeDepth = tree_lengths, BernoulliDeviance = min_errors, NumTrees = min_error_idx );
knitr::kable( gbtrees, format="markdown", longtable=TRUE);
#summary( boosted_model );
#plot( boosted_model );
set.seed(best_idx);
best_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = best_idx, n.cores = nCores, verbose=FALSE );
gbm.perf( best_model );
boosted_train_pred = predict( best_model, newdata=xTrain, n.trees=min_error_idx[best_idx], type="response");
boosted_pred = predict( best_model, newdata=xTest, n.trees=min_error_idx[best_idx], type="response");
# probabilities >= 0.5 mean survived, < 0.5 mean perished
boosted_train_predictions = as.numeric( boosted_train_pred >= threshold );
boosted_predictions = as.numeric( boosted_pred >= threshold );
boostedModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=boosted_train_predictions);
boostedModel = data.frame(PassengerId=newTest$PassengerId, Survived=boosted_predictions);
boostedModelTrainProbs = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(boosted_train_pred));
boostedModelProbs = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(boosted_pred));
write.csv( boostedModelTrainProbs, "BoostedTreeModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelProbs, "BoostedTreeModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelTrain, "BoostedTreeModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModel, "BoostedTreeModel.csv", row.names=FALSE, quote=FALSE );
# originally scored 0.76076 with interaction depth =4, 0.77990 with interaction depth = 1
# after 10-fold x-validation, considering interaction depths of 1 to 4, scored a 0.76555
# same model , stopping at 4800 trees yields 0.77511
library(gbm, quietly = TRUE);
##
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
##
## aml
## The following object is masked from 'package:caret':
##
## cluster
## Loaded gbm 2.1.3
tree_lengths = 1:10;
numLengths = length( tree_lengths );
best_error = Inf;
best_idx = 0;
best_shrinkage = 0;
best_steps = 0;
xTrain = newTrain[,c("Survived","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = newTest[,c("Survived","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
learningRates = c(0.1,0.07,0.05,0.03,0.02,0.01);
numLR = length( learningRates );
threshold = 0.5;
cvFolds = 10; # n-fold validation
nTrees = 3000;
nCores = 7; # adjust according to computer
tree_depths = rep(0,numLengths*numLR);
steps = rep(0,numLengths*numLR);
deviances = rep(0,numLengths*numLR);
shrinkages = rep(0,numLengths*numLR);
idx = 1;
for ( depth in tree_lengths ){
set.seed(depth);
for ( lrIdx in 1:numLR){
learningRate = learningRates[lrIdx];
boosted_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = depth, shrinkage = learningRate, n.cores = nCores, verbose=FALSE );
idxCVErr = gbm.perf( boosted_model, plot.it=FALSE );
cvError = boosted_model$cv.error[idxCVErr];
if ( cvError < best_error ){
best_error = cvError;
best_idx = depth;
best_shrinkage = learningRate;
best_steps = idxCVErr;
}
tree_depths[idx] = depth;
shrinkages[idx] = learningRate;
steps[idx] = idxCVErr;
deviances[idx] = cvError;
idx = idx + 1;
}
}
gbtrees = data.frame( TreeDepth = tree_depths, Shrinkage = shrinkages, NumTrees = steps, BernoulliDeviance = deviances );
knitr::kable( gbtrees, format="markdown", longtable=TRUE);
| TreeDepth | Shrinkage | NumTrees | BernoulliDeviance |
|---|---|---|---|
| 1 | 0.10 | 122 | 0.8408372 |
| 1 | 0.07 | 372 | 0.8402510 |
| 1 | 0.05 | 761 | 0.8413770 |
| 1 | 0.03 | 2239 | 0.8308918 |
| 1 | 0.02 | 2813 | 0.8343378 |
| 1 | 0.01 | 2943 | 0.8302995 |
| 2 | 0.10 | 124 | 0.8040676 |
| 2 | 0.07 | 464 | 0.7921007 |
| 2 | 0.05 | 483 | 0.8071318 |
| 2 | 0.03 | 448 | 0.8084767 |
| 2 | 0.02 | 1324 | 0.7910884 |
| 2 | 0.01 | 958 | 0.8213471 |
| 3 | 0.10 | 109 | 0.8066694 |
| 3 | 0.07 | 181 | 0.8009394 |
| 3 | 0.05 | 333 | 0.7897554 |
| 3 | 0.03 | 417 | 0.7885938 |
| 3 | 0.02 | 519 | 0.8010698 |
| 3 | 0.01 | 1471 | 0.7991779 |
| 4 | 0.10 | 139 | 0.7834033 |
| 4 | 0.07 | 108 | 0.7969865 |
| 4 | 0.05 | 127 | 0.8078885 |
| 4 | 0.03 | 355 | 0.7872869 |
| 4 | 0.02 | 543 | 0.7930899 |
| 4 | 0.01 | 911 | 0.7926584 |
| 5 | 0.10 | 68 | 0.8016500 |
| 5 | 0.07 | 167 | 0.7922652 |
| 5 | 0.05 | 287 | 0.7891908 |
| 5 | 0.03 | 186 | 0.8009506 |
| 5 | 0.02 | 499 | 0.7874025 |
| 5 | 0.01 | 568 | 0.8113521 |
| 6 | 0.10 | 46 | 0.7915514 |
| 6 | 0.07 | 67 | 0.8015155 |
| 6 | 0.05 | 85 | 0.8021446 |
| 6 | 0.03 | 174 | 0.8002787 |
| 6 | 0.02 | 320 | 0.7954187 |
| 6 | 0.01 | 478 | 0.8045433 |
| 7 | 0.10 | 40 | 0.7902456 |
| 7 | 0.07 | 101 | 0.7838852 |
| 7 | 0.05 | 131 | 0.7943580 |
| 7 | 0.03 | 154 | 0.7881931 |
| 7 | 0.02 | 273 | 0.7987344 |
| 7 | 0.01 | 519 | 0.7876161 |
| 8 | 0.10 | 64 | 0.8074978 |
| 8 | 0.07 | 55 | 0.7850531 |
| 8 | 0.05 | 118 | 0.7819934 |
| 8 | 0.03 | 265 | 0.7907267 |
| 8 | 0.02 | 432 | 0.7838358 |
| 8 | 0.01 | 479 | 0.7964976 |
| 9 | 0.10 | 31 | 0.7919973 |
| 9 | 0.07 | 65 | 0.8089280 |
| 9 | 0.05 | 113 | 0.7894301 |
| 9 | 0.03 | 132 | 0.7841065 |
| 9 | 0.02 | 241 | 0.7879951 |
| 9 | 0.01 | 494 | 0.7839586 |
| 10 | 0.10 | 40 | 0.7854725 |
| 10 | 0.07 | 56 | 0.7856801 |
| 10 | 0.05 | 108 | 0.7876004 |
| 10 | 0.03 | 128 | 0.8020566 |
| 10 | 0.02 | 253 | 0.7824585 |
| 10 | 0.01 | 358 | 0.7931526 |
#summary( boosted_model );
#plot( boosted_model );
# create a line chart
xrange = range( shrinkages );
yrange = range( deviances );
plot( xrange, yrange, type="n", xlab="Shrinkage", ylab="CV Accuracy (Bernoulli Deviance)");
colors = rainbow( numLengths );
linetype = tree_lengths;
plotchar = seq(0,numLengths,1);
# add lines
for ( i in tree_lengths){
thisDepthTrees = subset( gbtrees, gbtrees$TreeDepth == i );
lines(thisDepthTrees$Shrinkage, thisDepthTrees$BernoulliDeviance, type="b", lwd=1.5, lty=linetype[i], col=colors[i], pch=plotchar[i]);
}
title( "Bernoulli Deviance by Shrinkage and Tree Depth");
legend( 0.08, yrange[2], 1:numLengths, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Tree Depth");
set.seed(best_idx);
best_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = best_idx, shrinkage = best_shrinkage, n.cores = nCores, verbose=FALSE );
gbm.perf( best_model );
## [1] 92
boosted_train_pred = predict( best_model, newdata=xTrain, n.trees=best_steps, type="response");
boosted_train_calls = as.numeric( boosted_train_pred >= threshold );
confusionMatrix(boosted_train_calls,newTrain$Survived);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 513 59
## 1 36 283
##
## Accuracy : 0.8934
## 95% CI : (0.8712, 0.9129)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7717
## Mcnemar's Test P-Value : 0.024
##
## Sensitivity : 0.9344
## Specificity : 0.8275
## Pos Pred Value : 0.8969
## Neg Pred Value : 0.8871
## Prevalence : 0.6162
## Detection Rate : 0.5758
## Detection Prevalence : 0.6420
## Balanced Accuracy : 0.8810
##
## 'Positive' Class : 0
##
boosted_pred = predict( best_model, newdata=xTest, n.trees=best_steps, type="response");
# probabilities >= 0.5 mean survived, < 0.5 mean perished
boosted_predictions = as.numeric( boosted_pred >= threshold );
boostedModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=boosted_train_calls);
boostedModel = data.frame(PassengerId=newTest$PassengerId, Survived=boosted_predictions);
boostedModelTrainProbs = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(boosted_train_pred));
boostedModelProbs = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(boosted_pred));
write.csv( boostedModelTrainProbs, "BoostedTreeModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelProbs, "BoostedTreeModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelTrain, "BoostedTreeModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModel, "BoostedTreeModel.csv", row.names=FALSE, quote=FALSE );
# interaction.depth = 8, nTrees = 118, learningRate = 0.05
# scored 0.756
The Cross-validated parameters (interaction depth = 8, number of trees = 118, learning rate = 0.05) resulted in a training set accuracy of 0.8943 on the training set and 0.756 on the test set.
The same technique is used as above to find the best set of parameters for gradient boosting model for the significant features.
library(gbm);
tree_lengths = 1:10;
numLengths = length( tree_lengths );
best_error = Inf;
best_idx = 0;
best_shrinkage = 0;
best_steps = 0;
significant_features = c("FixedTitle", "Sex", "Siblings", "FixedAge", "CabinAssignment", "pclass", "NumChildren", "FarePerPassenger", "Embarked");
xTrain = newTrain[,c("Survived", significant_features) ];
xTest = newTest[,c("Survived", significant_features) ];
learningRates = c(0.1,0.07,0.05,0.03,0.02,0.01);
numLR = length( learningRates );
threshold = 0.5;
cvFolds = 10; # n-fold validation
nTrees = 3000;
nCores = 7; # adjust according to computer
tree_depths = rep(0,numLengths*numLR);
steps = rep(0,numLengths*numLR);
deviances = rep(0,numLengths*numLR);
shrinkages = rep(0,numLengths*numLR);
idx = 1;
for ( depth in tree_lengths ){
set.seed(depth);
for ( lrIdx in 1:numLR){
learningRate = learningRates[lrIdx];
boosted_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = depth, shrinkage = learningRate, n.cores = nCores, verbose=FALSE );
idxCVErr = gbm.perf( boosted_model, plot.it=FALSE );
cvError = boosted_model$cv.error[idxCVErr];
if ( cvError < best_error ){
best_error = cvError;
best_idx = depth;
best_shrinkage = learningRate;
best_steps = idxCVErr;
}
tree_depths[idx] = depth;
shrinkages[idx] = learningRate;
steps[idx] = idxCVErr;
deviances[idx] = cvError;
idx = idx + 1;
}
}
gbtrees = data.frame( TreeDepth = tree_depths, Shrinkage = shrinkages, NumTrees = steps, BernoulliDeviance = deviances );
knitr::kable( gbtrees, format="markdown", longtable=TRUE);
| TreeDepth | Shrinkage | NumTrees | BernoulliDeviance |
|---|---|---|---|
| 1 | 0.10 | 153 | 0.8408302 |
| 1 | 0.07 | 571 | 0.8382404 |
| 1 | 0.05 | 304 | 0.8458554 |
| 1 | 0.03 | 995 | 0.8296206 |
| 1 | 0.02 | 629 | 0.8401464 |
| 1 | 0.01 | 1897 | 0.8301282 |
| 2 | 0.10 | 126 | 0.8073095 |
| 2 | 0.07 | 422 | 0.7910109 |
| 2 | 0.05 | 334 | 0.8082076 |
| 2 | 0.03 | 449 | 0.8146400 |
| 2 | 0.02 | 1265 | 0.7942554 |
| 2 | 0.01 | 972 | 0.8232836 |
| 3 | 0.10 | 100 | 0.7984287 |
| 3 | 0.07 | 149 | 0.7976263 |
| 3 | 0.05 | 238 | 0.7966997 |
| 3 | 0.03 | 317 | 0.7924063 |
| 3 | 0.02 | 508 | 0.8024181 |
| 3 | 0.01 | 1474 | 0.8006754 |
| 4 | 0.10 | 138 | 0.7866196 |
| 4 | 0.07 | 94 | 0.8003936 |
| 4 | 0.05 | 144 | 0.8054965 |
| 4 | 0.03 | 350 | 0.7944821 |
| 4 | 0.02 | 760 | 0.7988662 |
| 4 | 0.01 | 911 | 0.8000948 |
| 5 | 0.10 | 112 | 0.8077974 |
| 5 | 0.07 | 168 | 0.7915250 |
| 5 | 0.05 | 182 | 0.7860657 |
| 5 | 0.03 | 252 | 0.8017584 |
| 5 | 0.02 | 346 | 0.7924162 |
| 5 | 0.01 | 559 | 0.8152680 |
| 6 | 0.10 | 55 | 0.7973603 |
| 6 | 0.07 | 97 | 0.8045107 |
| 6 | 0.05 | 85 | 0.8081989 |
| 6 | 0.03 | 186 | 0.7906270 |
| 6 | 0.02 | 282 | 0.7930570 |
| 6 | 0.01 | 426 | 0.8056687 |
| 7 | 0.10 | 40 | 0.7941372 |
| 7 | 0.07 | 86 | 0.7867753 |
| 7 | 0.05 | 113 | 0.7992422 |
| 7 | 0.03 | 169 | 0.7881527 |
| 7 | 0.02 | 267 | 0.7961951 |
| 7 | 0.01 | 522 | 0.7967638 |
| 8 | 0.10 | 51 | 0.8050424 |
| 8 | 0.07 | 97 | 0.7838809 |
| 8 | 0.05 | 88 | 0.7897468 |
| 8 | 0.03 | 148 | 0.7863680 |
| 8 | 0.02 | 328 | 0.7818409 |
| 8 | 0.01 | 473 | 0.7996210 |
| 9 | 0.10 | 31 | 0.7961626 |
| 9 | 0.07 | 76 | 0.8101975 |
| 9 | 0.05 | 112 | 0.7938873 |
| 9 | 0.03 | 184 | 0.7894002 |
| 9 | 0.02 | 202 | 0.7953271 |
| 9 | 0.01 | 484 | 0.7881808 |
| 10 | 0.10 | 41 | 0.7866709 |
| 10 | 0.07 | 50 | 0.7920365 |
| 10 | 0.05 | 108 | 0.7915468 |
| 10 | 0.03 | 118 | 0.8051032 |
| 10 | 0.02 | 296 | 0.7828338 |
| 10 | 0.01 | 359 | 0.7974175 |
#summary( boosted_model );
#plot( boosted_model );
# create a line chart
xrange = range( shrinkages );
yrange = range( deviances );
plot( xrange, yrange, type="n", xlab="Shrinkage", ylab="Cross Validation Error (Bernoulli Deviance)");
colors = rainbow( numLengths );
linetype = tree_lengths;
plotchar = seq(0,numLengths,1);
# add lines
for ( i in tree_lengths){
thisDepthTrees = subset( gbtrees, gbtrees$TreeDepth == i );
lines(thisDepthTrees$Shrinkage, thisDepthTrees$BernoulliDeviance, type="b", lwd=1.5, lty=linetype[i], col=colors[i], pch=plotchar[i]);
}
title( "Bernoulli Deviance by Shrinkage and Tree Depth");
legend( 0.08, yrange[2], 1:numLengths, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Tree Depth");
set.seed(best_idx);
best_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = best_idx, shrinkage = best_shrinkage, n.cores = nCores, verbose=FALSE );
gbm.perf( best_model );
## [1] 306
boosted_train_pred = predict( best_model, newdata=xTrain, n.trees=best_steps, type="response");
boosted_train_calls = as.numeric( boosted_train_pred >= threshold );
confusionMatrix(boosted_train_calls,newTrain$Survived);
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 517 64
## 1 32 278
##
## Accuracy : 0.8923
## 95% CI : (0.87, 0.9119)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7681
## Mcnemar's Test P-Value : 0.001557
##
## Sensitivity : 0.9417
## Specificity : 0.8129
## Pos Pred Value : 0.8898
## Neg Pred Value : 0.8968
## Prevalence : 0.6162
## Detection Rate : 0.5802
## Detection Prevalence : 0.6521
## Balanced Accuracy : 0.8773
##
## 'Positive' Class : 0
##
boosted_pred = predict( best_model, newdata=xTest, n.trees=best_steps, type="response");
# probabilities >= 0.5 mean survived, < 0.5 mean perished
boosted_predictions = as.numeric( boosted_pred >= threshold );
boostedModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=boosted_train_calls);
boostedModel = data.frame(PassengerId=newTest$PassengerId, Survived=boosted_predictions);
boostedModelTrainProbs = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(boosted_train_pred));
boostedModelProbs = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(boosted_pred));
write.csv( boostedModelTrainProbs, "BoostedTreeSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelProbs, "BoostedTreeSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelTrain, "BoostedTreeSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModel, "BoostedTreeSigModel.csv", row.names=FALSE, quote=FALSE );
# interaction.depth = 8, nTrees = 328, learningRate = 0.02
# scored 0.75119
The cross validated parameters were found to be an interaction depth of 8, 328 trees, and a learning rate of 0.02. The model achieved a score of 0.8923 on the training set and a sad 0.7368 on the test set.
Support Vector Machines seek to find the optimal separating hyperplane between two classes. There are several parameters to consider: cost (of a constraint violation), choice of kernel (and corresponding gamma). Cross-validation is used to estimate these parameters.
library(e1071);
library(caret);
#xTrain = newTrain[,c("pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
#yTrain = newTrain[,"SurvivedFactor"];
xTrain = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=whiteTrain);
yTrain = newTrain[,"SurvivedFactor"];
#xTest = newTest[,c("pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=whiteTest);
#xTrainSVM = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain);
#xTestSVM = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTest);
costs = c(0.01, 0.1, 0.5, 0.9, 1, 1.1);
gammas = c(0.0001, 0.01, 0.05, 0.1, 0.5);
kernels = c("linear", "polynomial", "radial", "sigmoid");
bestCost = costs[1];
bestGamma = gammas[1];
bestKernel = kernels[1];
bestError = Inf;
errors = rep(Inf,1,length(kernels));
ctr = 1;
for ( kernel in kernels ){
tune.out = tune( "svm", xTrain, train.y=yTrain, kernel=kernel, ranges=list( cost=costs,gamma=gammas) );
tune.out$best.performance
tune.out$best.model
thisError = tune.out$best.performance;
errors[ctr] = thisError;
if ( thisError < bestError ){
bestError = thisError;
bestKernel = kernel;
bestGamma = tune.out$best.model$gamma;
bestCost = tune.out$best.model$cost;
}
ctr = ctr + 1;
}
svm.fit = svm( x=xTrain, y=yTrain, kernel=bestKernel, cost=bestCost, gamma=bestGamma);
#plot( svm.fit, xTrain );
svm.train.pred = predict( svm.fit, xTrain );
svmTrainingPreds = as.numeric( svm.train.pred ) - 1;
truth = newTrain$Survived;
confusionMatrix( svmTrainingPreds, truth );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 490 84
## 1 59 258
##
## Accuracy : 0.8395
## 95% CI : (0.8137, 0.863)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.656
## Mcnemar's Test P-Value : 0.04475
##
## Sensitivity : 0.8925
## Specificity : 0.7544
## Pos Pred Value : 0.8537
## Neg Pred Value : 0.8139
## Prevalence : 0.6162
## Detection Rate : 0.5499
## Detection Prevalence : 0.6442
## Balanced Accuracy : 0.8235
##
## 'Positive' Class : 0
##
# scores a 0.8294 on the training set
svm.test.pred = predict( svm.fit, xTest );
svmPredictions = as.numeric( svm.test.pred ) - 1;
svmModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=svmTrainingPreds);
svmModel = data.frame(PassengerId=newTest$PassengerId, Survived=svmPredictions);
write.csv( svmModelTrain, "SVMModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( svmModel, "SVMModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( svmModelTrain, "SVMModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( svmModel, "SVMModel.csv", row.names=FALSE, quote=FALSE );
# scored a 0.7727
The SVM model scored a 0.8328 against the training set and a 0.7727 against the test set.
This neural network implementation uses a single layer. The number of hidden nodes and learning rate is selected through 10-fold cross validation.
Ensemble models use multiple models to make predictions. Ensemble models usually fall into one of three different types: bagging, boosting, and stacking. In a bagging model paradigm, a single model (usually) is run multiple times over different subsamples of the data. These models are usually not complex to avoid overfitting and hopefully learn one part of the data and so are uncorrelated with one another. The output of the models are then combined to make a prediction. The Gradient Boosting model that we used with classification trees is an example of an ensemble model where successive models are employed to reduce the residuals of the previous model. Finally, a stacking ensemble model is much like a traditional machine learning model only the features or inputs to the learning model are themselves outputs of different types of machine learning models. The caveat, however, is that in order for ensembling to work, the models must be uncorrelated.
The idea in this section is to gather the results from all the trained models. Where possible, I used the probabilties of individual predictions (i.e., 0.983 chance of survival) over class calls (binary 0=dead, 1=survived) as that data is a little more fine-tuned. For this dataset, many of the models that I have trained are drastically overfit. The gradient boosted tree models, with training accuracies of 98% are extremely overfit. Other models are just not well-suited for this type of problem (KNN, LDA, QDA). Once I pared the list of models to just include the models that I thought were well-suited for this problem, I computed all of the cross-correlations between models and then enumerated all of the sets of models that were below a specified cross-correlation (I picked 0.75 to signify a high correlation). Once I did this, I found that all of the remaining models were highly correlated to one another, so I coudn’t make an ensemble! But since I’m determined to make an ensemble, I pressed on.
I added back the LDA, QDA, and neural network models and reran the correlation analysis. I purposely excluded the KNN and Bagging models since I believe KNN to not be well-suited for this problem and the bagging models are massively overfit. The correlation analysis showed which models could be combined into a single stacked model (see below). I manually selected the set of 3 models: RandomForest, Gender model, and QDA, since they represent three models of different complexity (simple: gender model, balanced: Random Forest, complex: QDA), hoping the strengths of each would complement each other.
In addition, I decided to use all the models in a different ensemble model, just taking row means. This set included all of the models, including the overfit bagging models, KNN, boosting, etc. Many of these models are highly correlated so the ensemble method is just a simple mean across all of the models in each row.
library(matlab,quietly=TRUE);
##
## Attaching package: 'matlab'
## The following object is masked from 'package:stats':
##
## reshape
## The following objects are masked from 'package:utils':
##
## find, fix
## The following object is masked from 'package:base':
##
## sum
library(ROCR,quietly=TRUE);
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(tidyr,quietly=TRUE);
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:Matrix':
##
## expand
library(dplyr,quietly=TRUE);
library(tibble,quietly=TRUE);
##
## Attaching package: 'tibble'
## The following object is masked from 'package:assertthat':
##
## has_name
all_train_models = c( "BagModelTrainProbs.csv", "BagSigModelTrainProbs.csv", "BoostedTreeModelTrainProbs.csv", "BoostedTreeSigModelTrainProbs.csv", "GenderModelTrainProbs.csv", "KNNBest6ModelTrainProbs.csv", "KNNBest7ModelTrainProbs.csv", "KNNBest8ModelTrainProbs.csv", "KNNBest9ModelTrainProbs.csv", "KNNFullModelTrainProbs.csv", "LassoRegressionFullModelTrainProbs.csv", "LassoRegressionSigModelTrainProbs.csv", "LDAModelTrainProbs.csv", "LDAReducedModelTrainProbs.csv", "LogisticRegressionFullModelTrainProbs.csv", "LogisticRegressionSigModelTrainProbs.csv", "NeuralNetworkSigModelCVTrainProbs.csv", "NeuralNetworkSigModelTrainProbs.csv", "NullModelTrainProbs.csv", "QDAModelTrainProbs.csv", "QDASigModelTrainProbs.csv", "RandomForestModelTrainProbs.csv", "RandomForestSigModelTrainProbs.csv", "RandomForestTrimmedModelTrainProbs.csv", "RidgeRegressionFullModelTrainProbs.csv", "RidgeRegressionSigModelTrainProbs.csv", "SVMModelTrainProbs.csv", "TreeModelTrainProbs.csv", "TreeSigModelTrainProbs.csv", "WCModelTrainProbs.csv" );
train_models = c( "BoostedTreeModelTrainProbs.csv", "BoostedTreeSigModelTrainProbs.csv", "GenderModelTrainProbs.csv", "LassoRegressionFullModelTrainProbs.csv", "LassoRegressionSigModelTrainProbs.csv", "LDAModelTrainProbs.csv", "LDAReducedModelTrainProbs.csv", "LogisticRegressionFullModelTrainProbs.csv", "LogisticRegressionSigModelTrainProbs.csv", "NeuralNetworkSigModelCVTrainProbs.csv", "NeuralNetworkSigModelTrainProbs.csv", "QDAModelTrainProbs.csv", "QDASigModelTrainProbs.csv", "RandomForestModelTrainProbs.csv", "RandomForestSigModelTrainProbs.csv", "RandomForestTrimmedModelTrainProbs.csv", "RidgeRegressionFullModelTrainProbs.csv", "RidgeRegressionSigModelTrainProbs.csv", "SVMModelTrainProbs.csv", "TreeModelTrainProbs.csv", "TreeSigModelTrainProbs.csv", "WCModelTrainProbs.csv" );
all_test_models = c( "BagModelProbs.csv", "BagSigModelProbs.csv", "BoostedTreeModelProbs.csv", "BoostedTreeSigModelProbs.csv", "GenderModelProbs.csv", "KNNBest6ModelProbs.csv", "KNNBest7ModelProbs.csv", "KNNBest8ModelProbs.csv", "KNNBest9ModelProbs.csv", "KNNFullModelProbs.csv", "LassoRegressionFullModelProbs.csv", "LassoRegressionSigModelProbs.csv", "LDAModelProbs.csv", "LDAReducedModelProbs.csv", "LogisticRegressionFullModelProbs.csv", "LogisticRegressionSigModelProbs.csv", "NeuralNetworkSigModelCVProbs.csv", "NeuralNetworkSigModelProbs.csv", "NullModelProbs.csv", "QDAModelProbs.csv", "QDASigModelProbs.csv", "RandomForestModelProbs.csv", "RandomForestSigModelProbs.csv", "RandomForestTrimmedModelProbs.csv", "RidgeRegressionFullModelProbs.csv", "RidgeRegressionSigModelProbs.csv", "SVMModelProbs.csv", "TreeModelProbs.csv", "TreeSigModelProbs.csv", "WCModelProbs.csv" );
test_models = c( "BoostedTreeModelProbs.csv", "BoostedTreeSigModelProbs.csv", "GenderModelProbs.csv", "LassoRegressionFullModelProbs.csv", "LassoRegressionSigModelProbs.csv", "LDAModelProbs.csv", "LDAReducedModelProbs.csv", "LogisticRegressionFullModelProbs.csv", "LogisticRegressionSigModelProbs.csv", "NeuralNetworkSigModelCVProbs.csv", "NeuralNetworkSigModelProbs.csv", "QDAModelProbs.csv", "QDASigModelProbs.csv", "RandomForestModelProbs.csv", "RandomForestSigModelProbs.csv", "RandomForestTrimmedModelProbs.csv", "RidgeRegressionFullModelProbs.csv", "RidgeRegressionSigModelProbs.csv", "SVMModelProbs.csv", "TreeModelProbs.csv", "TreeSigModelProbs.csv", "WCModelProbs.csv" );
trainModels = data.frame(PassengerId=1:nrow(newTrain), Survived=newTrain$Survived );
testModels = data.frame(PassengerId=1:nrow(newTest));
# read all the train model predictions and store in dataframe
numModels = length( train_models );
for ( i in 1:numModels ){
trainFileName = train_models[i];
fp = fileparts( trainFileName );
modelName = gsub( "TrainProbs", "", fp$name );
trainModelFrame = read.csv( trainFileName );
trainModels[ modelName ] = trainModelFrame$Survived;
# now for test
testFileName = test_models[i];
fp = fileparts( testFileName );
modelName = gsub( "Probs", "", fp$name );
testModelFrame = read.csv( testFileName );
testModels[ modelName ] = testModelFrame$Survived;
}
allTrainModels = data.frame(PassengerId=1:nrow(newTrain), Survived=newTrain$Survived );
allTestModels = data.frame(PassengerId=1:nrow(newTest));
numAllModels = length( all_train_models );
for ( i in 1:numAllModels ){
trainFileName = all_train_models[i];
fp = fileparts( trainFileName );
modelName = gsub( "TrainProbs", "", fp$name );
trainModelFrame = read.csv( trainFileName );
allTrainModels[ modelName ] = trainModelFrame$Survived;
# now for test
testFileName = all_test_models[i];
fp = fileparts( testFileName );
modelName = gsub( "Probs", "", fp$name );
testModelFrame = read.csv( testFileName );
allTestModels[ modelName ] = testModelFrame$Survived;
}
all_train_model_names = setdiff( names(allTrainModels), c("PassengerId", "Survived") );
# this calculates all correlations between models. Some corr calcs break because
# they're all one value so standard deviation calc blows up.
modelCors <- trainModels[,!names(trainModels) %in% c("PassengerId","Survived")] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'Model_A') %>%
gather(Model_B, correlation, -Model_A)
corrThresh = 0.75;
modelNames = unique(modelCors$Model_A);
# this recursive function is intended to output groups of models that
# have inter correlations below threshold
addToModelSet = function( modelSet, corrs, model, corrThresh ){
#modelSet[ length(modelSet)+1 ] = model;
if ( ! model %in% modelSet ){
modelSet = union( modelSet, model );
}
idxUsed = c();
idxUnused = 1:nrow(corrs);
toUse = c();
for ( idx in idxUnused ){
thisModel = corrs$Model_A[idx];
if ( thisModel %in% modelSet ){
toUse[ length(toUse) + 1] = idx;
}
}
for ( idx in toUse ){
addModel = TRUE;
modelA = corrs$Model_A[idx];
modelB = corrs$Model_B[idx];
correlation = corrs$correlation[idx];
if ( correlation < corrThresh ){
# now, check corr with every other member of modelSet
for ( model in modelSet ){
# want to check modelB corr with all models in modelSet
idxModelA = which( corrs$Model_A == model );
idxModelB = which( corrs$Model_B == modelB );
if ( model == modelB ){
idxUsed[ length(idxUsed) + 1 ] = idx;
# correlations of the same model
next;
}
rightRowIdx = intersect( idxModelA, idxModelB );
if( length( rightRowIdx ) == 1 && rightRowIdx < nrow(corrs) ){
rightRow = corrs[ rightRowIdx, ];
}
else{
next;
}
rightCorr = rightRow$correlation;
if ( rightCorr > corrThresh ){
idxUsed[ length(idxUsed) + 1 ] = idx;
addModel = FALSE;
break;
}
}
idxUsed[ length(idxUsed) + 1 ] = idx;
}
else{
addModel = FALSE;
}
if ( addModel ){
idxToUse = setdiff( idxUnused, idxUsed );
newCorrs = corrs[idxToUse,]
#newCorrs = corrs;
modelSet = addToModelSet( modelSet, newCorrs, modelB, corrThresh);
}
}
return( modelSet );
}
modelSets = list();
print( "Model sets that have inter-correlations less than threshold:")
## [1] "Model sets that have inter-correlations less than threshold:"
for ( model in modelNames ){
modelSet = addToModelSet( c(), modelCors, model, corrThresh );
cat( paste(paste(modelSet, collapse=", "),"\n"));
modelSets[ length(modelSets) + 1 ] = list( modelSet );
}
## BoostedTreeModel, GenderModel
## BoostedTreeSigModel, GenderModel
## GenderModel, BoostedTreeModel
## LassoRegressionFullModel
## LassoRegressionSigModel
## LDAModel, NeuralNetworkSigModel
## LDAReducedModel, NeuralNetworkSigModel
## LogisticRegressionFullModel
## LogisticRegressionSigModel
## NeuralNetworkSigModelCV, GenderModel, QDASigModel
## NeuralNetworkSigModel, GenderModel, QDAModel
## QDAModel, GenderModel, NeuralNetworkSigModel
## QDASigModel, GenderModel, NeuralNetworkSigModelCV
## RandomForestModel, GenderModel, QDASigModel
## RandomForestSigModel
## RandomForestTrimmedModel
## RidgeRegressionSigModel
## SVMModel, NeuralNetworkSigModel
## TreeModel
## TreeSigModel
## WCModel, NeuralNetworkSigModel, QDAModel
# now go through each model and find the other models which are correlated below
# threshold (0.75)
# manually choose the set #14
uncorrelatedModelSet = modelSets[[14]];
# now for all models that have passed criteria, create a model
selectedTrainModels = trainModels[,c("Survived", uncorrelatedModelSet)];
ensemble_lr = glm( Survived~., data=selectedTrainModels, family=binomial);
summary( ensemble_lr );
##
## Call:
## glm(formula = Survived ~ ., family = binomial, data = selectedTrainModels)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2024 -0.5514 -0.5514 0.4304 1.9793
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8411 0.1749 -10.524 < 2e-16 ***
## RandomForestModel 3.3197 0.3210 10.341 < 2e-16 ***
## GenderModel 0.1817 0.5384 0.337 0.73578
## QDASigModel 0.7193 0.2544 2.828 0.00469 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 712.48 on 887 degrees of freedom
## AIC: 720.48
##
## Number of Fisher Scoring iterations: 4
trainSetProbs = predict( ensemble_lr, trainModels, type="response"); # now a vector of probabilities
# probabilities >= 0.5 mean survived, < 0.5 mean perished
threshold = 0.5;
selectedTrainPredictions = as.numeric( trainSetProbs >= threshold );
confusionMatrix( selectedTrainPredictions, trainModels$Survived );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 524 102
## 1 25 240
##
## Accuracy : 0.8575
## 95% CI : (0.8328, 0.8798)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6853
## Mcnemar's Test P-Value : 1.542e-11
##
## Sensitivity : 0.9545
## Specificity : 0.7018
## Pos Pred Value : 0.8371
## Neg Pred Value : 0.9057
## Prevalence : 0.6162
## Detection Rate : 0.5881
## Detection Prevalence : 0.7026
## Balanced Accuracy : 0.8281
##
## 'Positive' Class : 0
##
# check to see if threshold 0.5 is appropriate
rocr.pred = prediction( trainSetProbs, trainModels$Survived);
acc.perf = performance( rocr.pred, measure="acc");
plot( acc.perf, main="Manual Ensemble Logistic Regression Model Training Set Accuracy by Threshold" );
# threshold 0.5 is appropriate
# now, how'd we do against the test set?
testSetProbs = predict( ensemble_lr, testModels, type="response");
testPredictions = as.numeric( testSetProbs >= threshold );
ensembleModelManual = data.frame(PassengerId=newTest$PassengerId, Survived=testPredictions);
write.csv( ensembleModelManual, "EnsembleModelManual.csv", row.names=FALSE, quote=FALSE );
# now try just an ensemble model, taking means
ensembleTrainModels = allTrainModels[,all_train_model_names];
# simply take the means across the rows to get "probabilities" of survival for each passenger
ensembleTrainPredictions = rowMeans( ensembleTrainModels );
ensembleTrainCalls = round( ensembleTrainPredictions );
confusionMatrix( ensembleTrainCalls, allTrainModels$Survived );
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 509 79
## 1 40 263
##
## Accuracy : 0.8664
## 95% CI : (0.8423, 0.8881)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7114
## Mcnemar's Test P-Value : 0.000495
##
## Sensitivity : 0.9271
## Specificity : 0.7690
## Pos Pred Value : 0.8656
## Neg Pred Value : 0.8680
## Prevalence : 0.6162
## Detection Rate : 0.5713
## Detection Prevalence : 0.6599
## Balanced Accuracy : 0.8481
##
## 'Positive' Class : 0
##
# check to see if threshold 0.5 is appropriate
rocr.pred = prediction( ensembleTrainPredictions, allTrainModels$Survived);
acc.perf = performance( rocr.pred, measure="acc");
plot( acc.perf, main="All Means Ensemble Logistic Regression Model Training Set Accuracy by Threshold" );
# threshold 0.5 is appropriate
# How'd the ensemble model do against the test set?
ensembleTestModels = testModels[,modelNames];
# simply take the means across the rows to get "probabilities" of survival for each passenger
ensembleTestPredictions = rowMeans( ensembleTestModels );
ensembleTestCalls = round( ensembleTestPredictions );
#fullLogisticRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=predictions);
#fullLogisticRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=fullLogisticPredictions);
# now, create a new test and train data frame with these models as predictors
# write to file
ensembleModelMeans = data.frame(PassengerId=newTest$PassengerId, Survived=ensembleTestCalls);
write.csv( ensembleModelMeans, "EnsembleModelMeans.csv", row.names=FALSE, quote=FALSE );
The manually selected linear regression ensemble model was constructed using three predictors: Random Forest, Gender model, and QDA. The training set score was 0.8575 and test set score was 0.7847. The simple row means model scored a 0.8664 against the training set and a 0.7703 against the test set. As would be expected, the row means model, which used some very overfit models, was itself overfit - scoring high against the training set and low against the test set - lower than the manually selected ensemble model.
Cheaters never prosper. Nevertheless.
All of the Titanic data is available online. It is possible to find resource pages listing who died and who survived. Of course, much of it is not in a standardized format and has to be screenscraped. In addition, there are errors and discrepancies both in the Kaggle data set and in the data online. Word for word searches fail in many cases.
library(RCurl);
library(XML);
library(readr);
#testIDs = 892:nrow(data_combined);
#userAgents = c( "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0", "Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36", "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1", "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0)");
userAgents = c( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063" );
numUserAgents = length( userAgents );
trainIDs = 1:891;
testIDs = 892:nrow(data_combined);
setIDs = 1:nrow(data_combined);
trainURLS = c();
setHits = data_combined[,c("PassengerId","Survived")];
setHits$SurvivorHits = 0;
setHits$VictimHits = 0;
setHits$SurvivorFirst = 0;
setHits$SurvivedPrediction = 0;
for ( pass_idx in 1:length(setIDs) ){
i = setIDs[pass_idx];
name = gsub( " ", "+", as.character( data_combined[i,"NameString"] ) ); # remove whitespace
sex = as.character( data_combined[i,"Sex"] );
searchURL = paste0( "https://www.google.com/search?q=", name, "+", sex, "+titanic" );
trainURLS[i] = searchURL;
randomUserAgent = userAgents[ round( runif(1, 1, numUserAgents) ) ];
wd = getwd();
localFile = paste0( wd, "/html/", i, ".html" );
if ( file.exists( localFile )){
resultsHTML = read_file( localFile );
}else{
resultsHTML = getURL( searchURL, httpheader=c('User-Agent'=randomUserAgent) );
write( resultsHTML, localFile );
Sys.sleep( runif( 1, 27, 67)); # randomly sleep between 27 and 67 seconds
# this is necessary because if Google thinks you're mining their data with a bot
# you will be grounded and sent to your room for a timeout.
}
# now search for "victim" and "survivor"
victimHits = gregexpr("victim", resultsHTML, ignore.case=TRUE);
victimHitIndices = which( victimHits[[1]] > -1 );
numVictimHits = length( victimHitIndices );
setHits[pass_idx,"VictimHits"] = numVictimHits;
firstVictimHit = victimHits[[1]][1];
survivorHits = gregexpr("survivor", resultsHTML, ignore.case=TRUE);
survivorHitIndices = which( survivorHits[[1]] > -1 );
numSurvivorHits = length( survivorHitIndices );
setHits[pass_idx,"SurvivorHits"] = numSurvivorHits;
firstSurvivorHit = survivorHits[[1]][1];
#cat( "row ", i, "\n")
#assert_that( ( numSurvivorHits > 0 ) || (numVictimHits > 0) );
if (!( ( numSurvivorHits > 0 ) || (numVictimHits > 0) ) ){
cat( "Could not find 'victim' or 'survivor' keyword for PassengerID ", i, ". Manually investigate and update .html file with 'survived' or 'victim'.\n")
}
survived = FALSE;
if ( numSurvivorHits > numVictimHits ){
survived = TRUE;
}
else if ( ( numSurvivorHits > 0 ) && ( numVictimHits > 0 ) ){
# both are positive, take the min
if ( firstSurvivorHit < firstVictimHit ){
survived = TRUE;
}
else{
survived = FALSE;
}
setHits[pass_idx,"SurvivorFirst"] = as.numeric( survived );
}
survived_flag = 0;
if( survived ){
survived_flag = 1;
}
setHits[pass_idx,"SurvivedPrediction"] = survived_flag;
if ( !is.na( data_combined[i,"Survived"] ) && survived_flag != data_combined[i,"Survived"] ){
#cat( "Prediction does not match truth at ", i, as.character( data_combined[i,"Name"] ), "\n");
}
}
# evaluate training set performance
trainingPredictions = setHits$SurvivedPrediction[1:891];
confusionMatrix( trainingPredictions, data_combined$Survived[1:891])
cheatTestPredictions = setHits$SurvivedPrediction[892:nrow(data_combined)];
write.csv( setHits, file="CheatModel1.csv", quote=FALSE);
A webpage exists that lists all the survivors in bold and all the passengers who died in normal font. It is possible to build a scraper to parse the passengers into two lists and then try to match the names in the Kaggle test set with the names in the Survived and Died sets derived from the webpage. The idea is to use both a bag-of-words model and a distance model to capture both number of token hits and handle any differences in spelling.
# read in CheatModel1
cheat <- read.csv("CheatModel1.csv", header = TRUE);
cheat$SurvivedFactor = as.factor( cheat$Survived );
cheat$SurvivedTokenMatches = 0;
cheat$DiedTokenMatches = 0;
cheat$SurvivedTokenDistance = 0;
cheat$DiedTokenDistance = 0;
resourceURL = "http://www.titanic-whitestarships.com/1st_Class_Pass.htm";
# Example HTML:
#<strong>Gordon, Sir Cosmo Duff<br>
#Gordon, Lady Lucile Duff<br>
#and Maid (Miss Laura Mabel Francatelli)<br>
#Gracie, Colonel Archibald IV</strong><br>
#Graham, Mr. George Edward<br>
#Graham1, Mr. George Edward<br>
#Graham2, Mr. George Edward<br>
#Graham3, Mr. George Edward<br>
#<strong>Graham, Mrs. William Thompson<br>
#(nee Edith Junkins)<br>
#Graham, Miss Margaret<br>
#Greenfield, Mrs. Leo David<br>
#(nee Blanche Strouse)<br>
#Greenfield, Mr. William Bertram</strong><br>
#Guggenheim, Mr. Benjamin<br>
#and Manservant (Victor Giglio)</font></p>
cheatSurvived = numeric(nrow(newTest));
wd = getwd();
localFile = file.path( wd, "TitanicResource.htm" );
# know the file is approx 2100 lines. Initialize each died/survived list to 3000
died = character(3000);
survived = character(3000);
if ( !file.exists( localFile )){ # Do this so we can load the URL once and not spam it everytime we run this script
resultsHTML = getURL( resourceURL );
write( resultsHTML, localFile );
}
# now, parse the file, line-by-line. Survivors are in bold, denoted by the <strong> tag:
fid = file( localFile, "r");
isStrong = FALSE;
diedCtr = 1;
survivedCtr = 1;
while( TRUE ){
line = readLines(fid,n=1);
if( length(line) == 0 ){
break;
}
# look for </strong>
thisLineStartStrong = grepl( "^<strong>", line );
thisLineEndStrong = grepl( "</strong>", line );
# if line contains <strong
# strip html tags
rawText = gsub( "<.*?>", "", line );
# skip if no comma
thisLineHasComma = grepl( ",", rawText );
if ( !thisLineHasComma ){
next;
}
# remove punctuation
rawText = gsub( "[[:punct:]]", "", rawText );
# skip if line (name) is too long
if ( nchar( rawText) > 100 ){
next;
}
if( isStrong || thisLineStartStrong ){
survived[survivedCtr] = rawText;
survivedCtr = survivedCtr + 1;
}
else{
died[diedCtr] = rawText;
diedCtr = diedCtr + 1;
}
isStrong = thisLineStartStrong && !thisLineEndStrong;
}
close(fid);
# truncate the died/survived lists
survived = sort( survived[1:survivedCtr-1] );
died = sort( died[1:diedCtr-1] ); # yes, this includes crap lines at the top of list
#
# now have two lists and need to do bag of words searching vs. each name
# for each name, tokenize, then query the died/survived lists
for( i in 1:nrow( newTest ) ){
thisName = newTest[i,"Name"];
# remove non-word characters
thisName = gsub( "[[:punct:]]", "", thisName );
# now tokenize each name and search each died/survived for distance from each token
nameTokens = strsplit( thisName, "\\s+")[[1]];
numTokens = length( nameTokens );
# search through survived
survivedTokenMatches = numeric(length(survived));
survivedTokenDistance = numeric(length(survived));
for ( j in 1:length( survived ) ){
thisScore = 0;
thisSurvived = survived[j];
survivedTokens = strsplit( thisSurvived, "\\s+")[[1]];
numSurvivedTokens = length( survivedTokens );
# now, take each name token and search for match in survived tokens
for ( k in 1:numTokens){
thisNameToken = nameTokens[k];
if ( thisNameToken %in% survivedTokens ){
thisScore = thisScore + 1;
}
}
survivedTokenMatches[j] = thisScore;
# go through each name token and find the min distance to survived tokens
totalDist = 0;
for ( k in 1:numTokens){
thisNameToken = nameTokens[k];
wordDist = 10;
distances = numeric(numSurvivedTokens);
for ( m in 1:numSurvivedTokens){
thisSurvivedToken = survivedTokens[m];
distances[m] = adist( thisNameToken, thisSurvivedToken );
}
totalDist = totalDist + min( distances );
}
survivedTokenDistance[j] = totalDist;
}
bestSurvivedScore = max( survivedTokenMatches );
bestSurvivedDistance = min( survivedTokenDistance );
cheat[i,"SurvivedTokenHits"] = bestSurvivedScore;
cheat[i,"SurvivedTokenDistance"] = bestSurvivedDistance;
# search through died
diedTokenMatches = numeric(length(died));
diedTokenDistance = numeric(length(died));
for ( j in 1:length( died ) ){
thisScore = 0;
thisDied = died[j];
diedTokens = strsplit( thisDied, "\\s+")[[1]];
numDiedTokens = length( diedTokens );
# now, take each name token and search for match in survived tokens
for ( k in 1:numTokens){
thisNameToken = nameTokens[k];
if ( thisNameToken %in% diedTokens ){
thisScore = thisScore + 1;
}
}
diedTokenMatches[j] = thisScore;
# go through each name token and find the min distance to survived tokens
totalDist = 0;
for ( k in 1:numTokens){
thisNameToken = nameTokens[k];
wordDist = 10;
distances = numeric(numDiedTokens);
for ( m in 1:numDiedTokens){
thisDiedToken = diedTokens[m];
distances[m] = adist( thisNameToken, thisDiedToken );
}
totalDist = totalDist + min( distances );
}
diedTokenDistance[j] = totalDist;
}
bestDiedScore = max( diedTokenMatches );
bestDiedDistance = min( diedTokenDistance );
cheat[i,"DiedTokenHits"] = bestSurvivedScore;
cheat[i,"DiedTokenDistance"] = bestSurvivedDistance;
#cat( thisName, " bestScore = ", bestScore, ", bestMatch = ", survived[bestIdx], "\n");
}
# now that we have some cheating scores, build a simple tree model
library(tree);
cheat_train = cheat[1:891,]; # use the training data
cheat_test= cheat[892:nrow(cheat),]; # use the test data
tree_cheat_model = tree( SurvivedFactor~SurvivorHits+VictimHits+SurvivorFirst+SurvivedTokenHits+SurvivedTokenDistance+DiedTokenHits+DiedTokenDistance,data=cheat_train );
summary( tree_cheat_model )
plot( tree_cheat_model )
text( tree_cheat_model, pretty=0);
cheat_train_pred = predict( tree_cheat_model, cheat_train, type="class");
cheat_test_pred = predict( tree_cheat_model, cheat_test, type="class");
# performance
confusionMatrix(cheat_train_pred, cheat_train$Survived);
cheatModelTest = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(cheat_test_pred)-1);
write.csv( cheatModelTest, "CheatModelFull.csv", row.names=FALSE, quote=FALSE );
Here is the summary of training and test results for the different models:
| Model | CVMethod | TrainingScore | TestScore |
|---|---|---|---|
| BagModel | - | 0.9876543 | 0.7320574 |
| BagSigModel | - | 0.9854097 | 0.7272727 |
| NeuralNetworkFullModel | - | 0.9203143 | 0.6746411 |
| NeuralNetworkSigModel | - | 0.9124579 | 0.7105263 |
| BoostedTreeModel | 8-fold CV | 0.8933782 | 0.7559809 |
| BoostedTreeSigModel | 8-fold CV | 0.8922559 | 0.7368421 |
| KNNBest8Model | LOOCV | 0.8787879 | 0.7559809 |
| KNNBest6Model | LOOCV | 0.8754209 | 0.7200957 |
| KNNBest7Model | LOOCV | 0.8754209 | 0.7344498 |
| KNNBest9Model | LOOCV | 0.8742985 | 0.7631579 |
| BagSigModel | - | 0.8664000 | 0.7272727 |
| RandomForestSigModel | - | 0.8608305 | 0.7822967 |
| BagModel | - | 0.8575000 | 0.7320574 |
| RandomForestModel | - | 0.8574635 | 0.7846890 |
| KNNFullModel | LOOCV | 0.8563412 | 0.7607656 |
| TreeSigModel | Pruning | 0.8540965 | 0.7607656 |
| NeuralNetworkSigModelCV | 10-fold CV | 0.8529742 | 0.7751196 |
| TreeModel | Pruning | 0.8507295 | 0.7727273 |
| RandomForestTrimmedModel | - | 0.8484848 | 0.7775120 |
| NeuralNetworkFullModelCV | 10-fold CV | 0.8417508 | 0.7846890 |
| LogisticRegressionSigModel | - | 0.8395062 | 0.7607656 |
| SVMModel | 10-fold CV | 0.8395062 | 0.7679426 |
| LassoRegressionSigModel | 10-fold CV | 0.8372615 | 0.7775120 |
| LDAModel | - | 0.8372615 | 0.7727273 |
| LassoRegressionFullModel | 10-fold CV | 0.8361392 | 0.7703349 |
| LDAReducedModel | - | 0.8338945 | 0.7727273 |
| LogisticRegressionFullModel | - | 0.8338945 | 0.7607656 |
| RidgeRegressionSigModel | 10-fold CV | 0.8249158 | 0.7870813 |
| RidgeRegressionFullModel | 10-fold CV | 0.8226712 | 0.7846890 |
| QDAModel | - | 0.8159371 | 0.7296651 |
| WomenAndChildrenFirstModel | - | 0.7901235 | 0.7511962 |
| gender_submission | - | 0.7867565 | 0.7631579 |
| QDASigModel | - | 0.7845118 | 0.7272727 |
| NullModel | - | 0.6161616 | 0.6244019 |
Now then, how does one go about picking a model out of all of these? Clearly, the training scores of some of the more complex models cannot be trusted as they’re obviously overfit. The bagged tree model has training accuracies of about 0.98! But how would we know that they’re overfit without peeking at the test scores? The boosted tree cross-validation accuracies were about 0.89 - but these also proved to be vastly overfit.
But suppose all of these training results were more or less equal - some at 0.84 cross-validated training set accuracy, some at 0.83 cross-validated training set accuracy, and one at 0.85 cross-validated training set accuracy. What then would be the criterion for choosing a model? At first blush, it might be obvious to choose the one that that has the highest cross-validated training set accuracy of 0.85. But what if that model is very complex with a lot of parameterization and there is a much simpler model that scored at 0.84. Then would it be better to choose the simpler, but lower-scoring model? Is the difference between 0.84 and 0.85 statistically significant or is the difference just due to random chance? Is there a way to quantify the statistical significance? I’ve seen McNemar’s test but it looks like it operates on the correct counts between only two different models, not when num_models > 2. And if we run McNemar’s test over all pairs of models, then we’re likely to see something “significant” just through chance and need to somehow correct for it.
If I had to pick the “best” model out of these, it would probably be the ridge regression model. The squared regularization penalty helps to drive down the coefficients of some of the predictors better than the similar lasso regression model.
This report is a little stream-of-consciousness as it is/was my first Kaggle competition and kernel. But what a great introduction to overfitting with R. The EDA portion was really enjoyable as was the feature engineering section. I thought that time put in to the early part would pay off later on but in retrospect, I probably should have tried to regularize the data a little more with outlier detection and deletion. But I was reluctant to delete any data that might help me eek out a few more percentage points!
Also, in retrospect, it is much easier to run all of the models through the caret framework. In this model, a single feature set can be run through different models fairly seemlessly - and then even run as ensembles at a later step.
Thanks for reading along!