About Kaggle

Kaggle is a site that hosts data science competitions. Individuals and teams compete for prizes to solve problems using data science and machine learning. There are currently 16 active competitions and over 250 completed competitions.

The Problem Statement

On April 14, 1912, the RMS Titanic struck an iceberg in the North Atlantic Ocean and sank. Of the 2,224 people on board, only 706 survived.

The goal of this exercise is to predict survivors on the Titanic based on nine input variables, described below. We are provided two datasets: (1) train.csv, containing 891 records and (2) test.csv, containing 418 records. The two datasets are provided with the intent that models are formulated using the train dataset and model performance is evaluated on the test dataset.


About the Data

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in year
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

source: https://www.kaggle.com/c/titanic/data

Sample Records

Training Data

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q

Test Data

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 Q
893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 S
894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 Q
895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 S
896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 S
897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 S

Exploratory Data Analysis

Missing Values

The first step is to find any and all missing data in the train and test sets.

Train Dataset

# go through each variable and if it's an empty factor or numeric NA, sum each column
nlTrainNA = sapply( train, function(x) switch( class(x), factor = sum(x==""), sum( is.na(x) ) ) );
tTrainNA = t(nlTrainNA);                                 # transpose named list
dfTrainNA = data.frame( tTrainNA );                      # convert to dataframe

The train.csv dataset had 3 columns with missing values: Age, Cabin, and Embarked. Age is likely to be an important predictor of survival and we have data for 80% of the training subjects so imputing the missing values is likely to be beneficial. The source of embarkation may not have obvious predictive power, but given that we have data for over 99% of the training subjects, imputing the missing values could provide value. The Cabin variable is missing from over 77% of the test subjects. At first impression, this variable seems like an unlikely candidate to impute values since so much source data is missing.

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 0 0 0 177 0 0 0 0 687 2

Test Dataset

nlTestNA = sapply( test, function(x) switch( class(x), factor = sum(x==""), sum( is.na(x) ) ) );
tTestNA = t(nlTestNA);                                 # transpose named list
dfTestNA = data.frame( tTestNA );                      # convert to dataframe

The test.csv dataset also had 3 columns with missing values: Age, Fare, and Cabin. Like the train dataset, the Cabin variable is sparse, with over 78% subjects missing values. The Age variable is populated for over 79% of the train subjects, and likely has good predictive power so it will likely be beneficial to impute values for subjects missing Age. The Fare variable is missing for a single subject. While Fare may not be an obvious predictor for survival, the fact that the dataset is over 99% complete for this variable indicates that it is a good candidate for imputation.

PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 0 0 86 0 0 0 1 327 0

Variable Features

PassengerId

PassengerId is a primary key for each row of data in the train and test sets. This variable will not be included in any of the predictive models.

Survived

Survived is the class we’re trying to predict.

Pclass

Passenger class is either 1st, 2nd, or 3rd.

ggplot(train, aes(x = factor(Pclass), fill = factor(Pclass))) +
  geom_bar( show.legend=FALSE) +
  xlab("Pclass") +
  ylab("Total Count")

The Pclass variable shows that most passengers in the train set held a 3rd class ticket. 491 of the 891 passengers were 3rd class, more than 1st and 2nd class combined.

ggplot(train, aes(x = factor(Pclass), fill = factor(Survived))) +
  geom_bar(width = 0.5, position="dodge") +
  xlab("Pclass") +
  ylab("Total Count") +
  labs(fill = "Survived")

If the Survived variable is plotted as a function of passenger class, it appears that Pclass will be a predictor for survivability. A higher percentage of first class passengers survived than died, contrary to the overall trend, whereas a far higher percentage of 3rd class passengers died than survived.

Name

It may seem that the passenger name is a lot like the PassengerId in that each name acts as a sort of primary key into the data and using name as a model feature would not generalize well. However, the name field could possibly provide value. The first twenty names in the train dataset:

head(as.character(train$Name),n=20);
##  [1] "Braund, Mr. Owen Harris"                                
##  [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"    
##  [3] "Heikkinen, Miss. Laina"                                 
##  [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"           
##  [5] "Allen, Mr. William Henry"                               
##  [6] "Moran, Mr. James"                                       
##  [7] "McCarthy, Mr. Timothy J"                                
##  [8] "Palsson, Master. Gosta Leonard"                         
##  [9] "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"      
## [10] "Nasser, Mrs. Nicholas (Adele Achem)"                    
## [11] "Sandstrom, Miss. Marguerite Rut"                        
## [12] "Bonnell, Miss. Elizabeth"                               
## [13] "Saundercock, Mr. William Henry"                         
## [14] "Andersson, Mr. Anders Johan"                            
## [15] "Vestrom, Miss. Hulda Amanda Adolfina"                   
## [16] "Hewlett, Mrs. (Mary D Kingcome) "                       
## [17] "Rice, Master. Eugene"                                   
## [18] "Williams, Mr. Charles Eugene"                           
## [19] "Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)"
## [20] "Masselmani, Mrs. Fatima"

Each name begins with a surname before a comma and a title. If the passenger is a married woman, her maiden name appears in parentheses. Some of the titles may help infer age (Master. and Miss.) and surnames could help determine extended family travelling together, even if they’ve purchased separate tickets and are not in the same cabin. Additionally, the presence of diacritical marks in a name could indicate that the passenger is a non-English speaker who might have had difficulty understanding instructions or the gravity of the situation.

Sex

Sex is an unordered factor, male or female.

ggplot(train, aes(x = factor(Sex), fill = factor(Sex))) +
  geom_bar( show.legend=FALSE) +
  xlab("Sex") +
  ylab("Total Count")

The Sex variable shows that most passengers in the training set were male (nearly 2/3 male). 577 of the 891 passengers were male, or 65%.

ggplot(train, aes(x = factor(Sex), fill = factor(Survived))) +
  geom_bar(width = 0.5, position="dodge") +
  xlab("Sex") +
  ylab("Total Count") +
  labs(fill = "Survived")

If the Survived variable is plotted as a function of Sex, it appears that Sex will be a strong predictor for survivability. Of the 314 female passengers in the training set, 233, or 74% survived. On the other hand, of the 577 male passengers, only 109 survived, or 19%.

Age

As noted, there are age values for 80% of the training subjects, missing for 177 passengers. The distribution of ages is slightly skewed right, with a median of 28 years and a mean of 29.7 years.

ggplot(subset(train, !is.na(Age)), aes(x = Age)) +
  geom_histogram(binwidth=4) +
  xlab("Age") +
  ylab("Total Count");

ggplot(subset(train,!is.na(Age)), aes(y=Age,x="")) + geom_boxplot();

Age as a predictor of survivability:

ggplot(subset(train,!is.na(Age)), aes(x = Age, fill = factor(Survived))) +
  geom_density( position="stack") +
  xlab("Age") +
  ylab("Total Count") +
  labs(fill = "Survived")

If we plot the Survived value as a function of the Age density, we see that there is a higher likelihood of younger passengers surviving over older passengers. Up until the mid-to-late teens, a training set passenger is more likely to survive than die, so age is likely to be a useful predictor for survivability.

SibSp

This variable is unique in that it is combination of number of siblings or “1” if the passenger had a spouse on board the ship. In some cases, it is not clear what the SibSp variable is encoding when a “1” is found - is the passenger travelling with a sibling or a spouse? A SibSp of 2 or more is indicative of siblings. It will likely be beneficial to disambiguate this variable into separate ‘Siblings’ and ‘Spouses’ variables.

ggplot(train, aes(x = SibSp )) +
   geom_histogram(binwidth=0.5) +
   xlab("Number of Siblings/Spouses") +
   ylab("Total Count")

Most passengers in the training set (68%) had neither a spouse or sibling on board. 23% of the training set passengers had a single sibling or spouse on board. The remaining 9% of the passengers had two or more (presumably) siblings on board.

ggplot(train, aes(x = SibSp, fill = factor(Survived))) +
   geom_histogram(binwidth=0.5, position="dodge") +
   xlab("Number of Siblings/Spouses") +
   ylab("Total Count") +
   labs(fill = "Survived")

Survivability does appear to trend with the number of siblings/spouses on board. A passenger having no siblings or spouses is most likely to have died, whereas a passenger with one or two siblings/spouses has around a 50% likelihood of surviving. The remaining cases of three to eight siblings are likely too few from which to draw inferences individually, so it might make sense to pool the SibSp values as follows: 0, 1, 2, 3> to avoid overfitting to specific training cases. A close examination of the seven instances of the SibSp variable in which SibSp equals 8 reveals that all the subjects were from the same family and were in the same cabin. Predicting that all families of size 8 will perish is unlikley to generalize well.

Parch

Similar to SibSp, this variable convolves two separate pieces of data: the number of parents and the number of children this passenger has on board. In some cases, it is not clear what the Parch variable is encoding when a “1” is found - is the passenger travelling with a parent or a child? This can be inferred if the Age variable is present for the passenger, but if the Age is missing, it will be ambiguous and may need further analysis. Perhaps the passenger’s title (Mr., Miss., Master) could help. Like SibSp, it will likely be beneficial to disambiguate this variable into separate ‘Parents’ and ‘Children’ variables.

ggplot(train, aes(x = Parch )) +
   geom_histogram(binwidth=0.5) +
   xlab("Number of Parents/Children") +
   ylab("Total Count")

Most passengers in the training set (76%) had neither a parent or child on board. 13% of the training set passengers had a single parent or child on board. About 9% of the passengers in the training set (80) had two or more parents or children on board. The remaining 2% of the passengers had either 3, 4, 5, or 6 (presumably) children on board.

ggplot(train, aes(x = Parch, fill = factor(Survived))) +
   geom_histogram(binwidth=0.5, position="dodge") +
   xlab("Number of Parents/Children") +
   ylab("Total Count") +
   labs(fill = "Survived")

Survivability does appear to trend with the number of parents/children on board. A passenger having no parents or children is most likely to have died, whereas a passenger with one or two parents/children has around a 50% likelihood of surviving. The remaining cases of three to six children are likely too few from which to draw inferences individually, so it might make sense to group the Parch values as follows: 0, 1, 2, 3>= to avoid overfitting to specific training cases.

Ticket

The entries in the Ticket column do not seem to be of a uniform format. Some ticket entries are just numbers - ranging from 693-392096. Other ticket entries have character prefixes like “C.A.” or “SOTON/O2”, followed by a (presumably) ticket number.

head(as.character(train$Ticket),n=20);
##  [1] "A/5 21171"        "PC 17599"         "STON/O2. 3101282"
##  [4] "113803"           "373450"           "330877"          
##  [7] "17463"            "349909"           "347742"          
## [10] "237736"           "PP 9549"          "113783"          
## [13] "A/5. 2151"        "347082"           "350406"          
## [16] "248706"           "382652"           "244373"          
## [19] "345763"           "2649"

An inspection of the set of tickets shows that presumed families tend to share a single ticket number. The first impression is that the ticket number seems an unlikely predictor for survivability and could lead to overfitting the training set. However, the ticket number might help populate the missing Cabin information - and Cabin might be a good predictor for survivability.

Fare

The fare (price paid per ticket) ranges from 0 to 512.3292. The units are unclear, but are likely in English pounds. The distribution is skewed to the right, with a median of 14.4542 and a mean of 32.20421. A log transform of the data may be necessary to normalize the distribution of fares. However, first the fare for passenger must be determined. It appears to be the case that individual ticket numbers are not assigned per passenger, but rather a single ticket number is given to the purchaser of an allotment of tickets. That is, families travelling together seem to be under the same ticket with the same fare. So, it may be necessary to get to create an “Amount Paid per Passenger” feature that takes into account the number of people for which a fare was purchased on a single ticket.

ggplot(train, aes(x = Fare)) +
  geom_histogram(binwidth=4) +
  xlab("Fare") +
  ylab("Total Count");

ggplot(train, aes(y=Fare,x="")) + geom_boxplot();

Fare could conceivably be an important factor in determining survivability. Perhaps the higher paying passengers received the first opportunity to board lifeboats. Or perhaps, those higher paying passengers were more initially unwilling to leave their more comfortable accomodations for the plebian conditions aboard a lifeboat. Fare as a predictor of survivability:

ggplot(train, aes(x = Fare, fill = factor(Survived))) +
  geom_density( position="stack") +
  xlab("Fare") +
  ylab("Total Count") +
  labs(fill = "Survived")

Cabin

The Cabin feature could be another strong predictor for survivability. Perhaps cabins located nearest the lifeboats afforded the best survivability. But, the Cabin variable has many empty values. The empty values could mean that the information was not captured or it could mean that not all passengers received cabins and stayed in other accomodations. Being assigned a cabin could be a proxy for one’s social status and wealth. If so, the Pclass variable might be co-linear.

levels(train$Cabin);
##   [1] ""                "A10"             "A14"            
##   [4] "A16"             "A19"             "A20"            
##   [7] "A23"             "A24"             "A26"            
##  [10] "A31"             "A32"             "A34"            
##  [13] "A36"             "A5"              "A6"             
##  [16] "A7"              "B101"            "B102"           
##  [19] "B18"             "B19"             "B20"            
##  [22] "B22"             "B28"             "B3"             
##  [25] "B30"             "B35"             "B37"            
##  [28] "B38"             "B39"             "B4"             
##  [31] "B41"             "B42"             "B49"            
##  [34] "B5"              "B50"             "B51 B53 B55"    
##  [37] "B57 B59 B63 B66" "B58 B60"         "B69"            
##  [40] "B71"             "B73"             "B77"            
##  [43] "B78"             "B79"             "B80"            
##  [46] "B82 B84"         "B86"             "B94"            
##  [49] "B96 B98"         "C101"            "C103"           
##  [52] "C104"            "C106"            "C110"           
##  [55] "C111"            "C118"            "C123"           
##  [58] "C124"            "C125"            "C126"           
##  [61] "C128"            "C148"            "C2"             
##  [64] "C22 C26"         "C23 C25 C27"     "C30"            
##  [67] "C32"             "C45"             "C46"            
##  [70] "C47"             "C49"             "C50"            
##  [73] "C52"             "C54"             "C62 C64"        
##  [76] "C65"             "C68"             "C7"             
##  [79] "C70"             "C78"             "C82"            
##  [82] "C83"             "C85"             "C86"            
##  [85] "C87"             "C90"             "C91"            
##  [88] "C92"             "C93"             "C95"            
##  [91] "C99"             "D"               "D10 D12"        
##  [94] "D11"             "D15"             "D17"            
##  [97] "D19"             "D20"             "D21"            
## [100] "D26"             "D28"             "D30"            
## [103] "D33"             "D35"             "D36"            
## [106] "D37"             "D45"             "D46"            
## [109] "D47"             "D48"             "D49"            
## [112] "D50"             "D56"             "D6"             
## [115] "D7"              "D9"              "E10"            
## [118] "E101"            "E12"             "E121"           
## [121] "E17"             "E24"             "E25"            
## [124] "E31"             "E33"             "E34"            
## [127] "E36"             "E38"             "E40"            
## [130] "E44"             "E46"             "E49"            
## [133] "E50"             "E58"             "E63"            
## [136] "E67"             "E68"             "E77"            
## [139] "E8"              "F E69"           "F G63"          
## [142] "F G73"           "F2"              "F33"            
## [145] "F38"             "F4"              "G6"             
## [148] "T"

The cabin name mostly adheres to the rule of a single letter A-F,G,T, followed by a number up to 3 digits. There are cases where a passenger has multiple cabins, each separated by whitespace. The beginning letter of each cabin could denote a deck or particular region of the ship - which could help with predicting survivability. Alternatively, the number of the cabin could be more informative than the beginning letter. Perhaps cabins “A19” and “B19” are located right next to one another, for instance.

cabinLetter = ifelse(train$Cabin == "", NA, substr(train$Cabin,1,1));
cabinREs = gregexpr("\\d+",train$Cabin, perl=TRUE);
cnMatches = regmatches( train$Cabin, cabinREs);
cabinNumber = numeric(length(cnMatches));
for ( i in 1:length(cnMatches) ){
  cabinNumber[i] = mean( as.numeric( unlist( cnMatches[i] )));
}
edaTrain$CabinLetter <- as.factor(cabinLetter);
edaTrain$CabinNumber <- cabinNumber;
ggplot( edaTrain, aes(x=CabinNumber,y=Survived,color=Survived ) ) + 
  geom_point( shape=1,position=position_jitter(height=0.25)) +
  ggtitle("Survivability by Cabin Number") +
  xlab("Cabin Number") +
  ylab("Survived");

ggplot(subset(edaTrain, !is.na(cabinLetter)), aes(x = CabinLetter, fill = as.factor(Survived))) +
  geom_bar() +
  ggtitle("Survivability by Ticket Letter") +
  xlab("Cabin Letter") +
  ylab("Total Count") +
  labs(fill = "Survived");

When grouping the passengers by cabin number, there does not appear to be a relationship where survivability depends on cabin number. If so, there should be identifiable pockets of clusters where there is a higher incidence of survivability. If such clusters appeared to exist, the clusters could be defined and the group of clusters tested with a Chi-Square test to measure if survivability depends on cabin cluster.

It’s not immediately obvious if there is a benefit to categorizing the cabins according to their first letter. Are these groups statistically different from one another? A Chi-Square test of independence should show if survivability is dependent on cabin letter or not.

tbl = table( edaTrain$Survived, edaTrain$CabinLetter);
tbl
##    
##      A  B  C  D  E  F  G  T
##   0  8 12 24  8  8  5  2  1
##   1  7 35 35 25 24  8  2  0
chisq.test(tbl);
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 10.301, df = 7, p-value = 0.1722

The p-value is 0.17 so at a confidence level of 0.05, we cannot reject the null hypothesis that survivability is independent of the starting cabin letter.

edaTrain$CabinAssignment[ edaTrain$Cabin != "" ] <- "Assigned";
edaTrain$CabinAssignment[ edaTrain$Cabin == "" ] <- "Unassigned";
data_combined$CabinAssignment[ data_combined$Cabin != "" ] <- "Assigned";
data_combined$CabinAssignment[ data_combined$Cabin == "" ] <- "Unassigned";
data_combined$CabinAssignment = factor( data_combined$CabinAssignment );
ggplot(edaTrain, aes(x=CabinAssignment, fill=factor(Survived))) +
  geom_bar() + 
  facet_wrap(~Pclass) + 
  ggtitle("Pclass") + 
  xlab("Cabin Assignment") +
  ylab("Total Count") +
  labs(fill="Survived");

From the graph, it appears that for each passenger class, if a passenger is assigned a cabin, their chances of surviving the disaster are better than if they had not been assigned a cabin.

Embarked

Passengers boarded the Titanic from one of three ports: (S)outhampton, England; (C)herbourg, France, or (Q)ueenstown, Ireland. As noted, two passengers in the training set have no record of port of Embarkation.

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
62 62 1 1 Icard, Miss. Amelie female 38 0 0 113572 80 B28
830 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0 113572 80 B28

The two passengers are both female and in first class.

ggplot( edaTrain, aes(x=factor(Pclass),fill=factor(Survived)))+
  geom_bar() + 
  facet_wrap(~Embarked) +
  ggtitle("Survivability by Port of Embarkation and Passenger Class") +
  xlab("Passenger Class") +
  ylab("Count") +
  labs(fill="Survived");


Errors in the Data

Errors in the input dataset are often not apparent until deeper analyses are performed, such as feature engineering and imputing missing data. However, once the data is corrected, it is necessary to regenerate the columns and feature analyses that will feed the predictive models. In this dataset, it was found that a 16-year-old member of a family was incorrectly identified as being the father:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked CabinAssignment
87 87 0 3 Ford, Mr. William Neal male 16 1 3 W./C. 6608 34.375 S Unassigned
148 148 0 3 Ford, Miss. Robina Maggie “Ruby” female 9 2 2 W./C. 6608 34.375 S Unassigned
437 437 0 3 Ford, Miss. Doolina Margaret “Daisy” female 21 2 2 W./C. 6608 34.375 S Unassigned
737 737 0 3 Ford, Mrs. Edward (Margaret Ann Watson) female 48 1 3 W./C. 6608 34.375 S Unassigned
1059 1059 NA 3 Ford, Mr. Edward Watson male 18 2 2 W./C. 6608 34.375 S Unassigned

In the Ford family, Mr. William Neal Ford has the value ‘1’ for SibSp and ‘3’ for Parch. The value of ‘1’ for SibSp would imply that William has either one spouse or one sibling. In addition, William has a value of ‘3’ for Parch. This indicates that William has either 3 children, 1 parent and 2 children, 2 parents and 1 child, or 3 parents, which is clearly not possible according to the variable definition. Since two of the children in the family, Miss. Doolina Margaret Ford (21), and Mr. Edward Watson Ford (18) are actually older than William, it is clear that he cannot be their father. The other child, Miss. Robina Maggie Ford (9) is only 7 years younger than William, also indicating that William cannot be her father. The matriarch of the family is Mrs. Edward Ford - not Mrs. William Neal Ford, which indicates that it’s likely not a simple case of getting William Neal Ford’s age wrong (for instance, 46 instead of 16).

Therefore, the best fix would be to change the Parch variables for Miss. Robina Maggie Ford, Miss. Doolina Margaret Ford, and Mr. Edward Watson Ford from ‘2’ to ‘1’ (due to the inclusion of Mr. William Neal as a brother). In addition, the Parch variable of Mrs. Edward Ford would rise from ‘3’ to ‘4’, indicating she’s travelling with four of her children and the Parch variable of Mr. William Neal Ford would change from ‘3’ to ‘1’ indicating that he’s travelling with a single parent, his mother. The SibSp variables would change from ‘2’ to ‘3’ for Miss. Robina Maggie Ford, Miss. Doolina Margaret Ford, and Mr. Edward Watson Ford, indicating that each of them are travelling with 3 siblings. Similarly, Mr. William Neal Ford’s SibSp variable would change from ‘1’ to ‘3’. Finally, Mrs. Edward Ford’s SibSp would change from ‘1’ to ‘0’, indicating she was travelling without her husband (Mr. Edward Ford). The changes appear below.

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked CabinAssignment
87 87 0 3 Ford, Mr. William Neal male 16 3 1 W./C. 6608 34.375 S Unassigned
148 148 0 3 Ford, Miss. Robina Maggie “Ruby” female 9 3 1 W./C. 6608 34.375 S Unassigned
437 437 0 3 Ford, Miss. Doolina Margaret “Daisy” female 21 3 1 W./C. 6608 34.375 S Unassigned
737 737 0 3 Ford, Mrs. Edward (Margaret Ann Watson) female 48 1 4 W./C. 6608 34.375 S Unassigned
1059 1059 NA 3 Ford, Mr. Edward Watson male 18 3 1 W./C. 6608 34.375 S Unassigned

Additionally, there are problems with the Abbott family with ticket C.A., 2673:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked CabinAssignment
280 280 1 3 Abbott, Mrs. Stanton (Rosa Hunt) female 35 1 1 C.A. 2673 20.25 S Unassigned
747 747 0 3 Abbott, Mr. Rossmore Edward male 16 1 1 C.A. 2673 20.25 S Unassigned
1284 1284 NA 3 Abbott, Master. Eugene Joseph male 13 0 2 C.A. 2673 20.25 S Unassigned

Thirteen year old Master. Eugene Joseph Abbott has a Parch value of ‘2’, meaning he has two parents on board. A female on the same ticket, Mrs. Stanton Abbott, is 22 years older than Eugene at 35 years of age. Mrs. Stanton Abbott has a SibSp of ‘1’ and a Parch of ‘1’. The other person on the ticket is 16-year-old Mr. Rossmore Edward Abbott, with a SibSp of ‘1’ and a Parch of ‘1’. It appears that 16-year-old Mr. Rossmore Edward Abbott was incorrectly designated the spouse of Mrs. Stanton Abbott instead of her son. The corrections appear below.

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked CabinAssignment
280 280 1 3 Abbott, Mrs. Stanton (Rosa Hunt) female 35 0 2 C.A. 2673 20.25 S Unassigned
747 747 0 3 Abbott, Mr. Rossmore Edward male 16 1 1 C.A. 2673 20.25 S Unassigned
1284 1284 NA 3 Abbott, Master. Eugene Joseph male 13 1 1 C.A. 2673 20.25 S Unassigned

Feature Engineering

When importing into R, factor/class variables solely composed of numeric entries are interpreted and imported as numeric types, implying an ordering and a distance between entries. These variables include PassengerId, Survived, and Pclass. PassengerId is like a primary key for each passenger, so it’s not necessary to convert this variable into a factor. However, the Survived data is encoded as a binary flag where 1=survived and 0=not survived. Since our goal is to predict classes (i.e., survived or died), it is necessary to convert this value into a factor. The Pclass variable is a factor, but it is an ordered factor. An ordered factor indicates that there is an ordering between the classes (1st class, 2nd class, 3rd class), but no magnitude between each class level can be inferred.

train$Survived = as.factor(train$Survived);
train$Pclass = as.ordered(train$Pclass);

Imputing Missing Data

Embarked

Two passengers had no information for their port of embarcation, Miss. Amelie Icard and Mrs. George Nelson Stone. These first class passengers were both on the same ticket (113572) and stayed in Cabin B28. The majority of passengers boarded at Southampton (70%), as compared to Cherbourg (21%) and Queenstown (9%). There is quite a variety in the ticket numbers on the Titanic. Their ticket number, 113572, is one in a series of similar ticket numbers, all in the 113XXX format. There are 56 tickets in the 113XXX format and of the 54 that have a valid port of embarcation, 81% were from Southampton. Considering the different data points, the two missing Embarked values will be considered ‘S’ for Southampton.

data_combined[62,"Embarked"] = as.factor("S");
data_combined[830,"Embarked"] = as.factor("S");
data_combined$Embarked = factor( data_combined$Embarked );  # removes empty factor ""

Cabin

As determined in the EDA section, over 77% of the passengers have no recorded cabin information. In addition, the assigned cabins did not seem to have a predictable pattern from the available data. Further, cursory analyses did not indicate that cabin assignments were a good predictor of survivability. As such, the missing cabin data will not be imputed.

Fare

Out of 1309 records, one fare is missing:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked CabinAssignment
1044 1044 NA 3 Storey, Mr. Thomas male 60.5 0 0 3701 NA S Unassigned

The fare can be estimated by considering the variables that most likely affect the Fare variable, such as passenger class (Pclass), port of Embarcation (Embarked), Cabin, and how many other people are on the same ticket. The passenger class and port of embarkation are readily available in each data row. The Cabin variable, however, is not useful in this case since the Cabin variable is not defined for passenger 1044. However, the fact that no Cabin was recorded for this passenger is significant since perhaps not all passengers were assigned cabins (i.e., wealthy passengers would be assigned cabins, and hence pay higher fares, while less privileged passengers may have to had large, unassigned community accomodations with a comparatively lower fare).

ticketChar = as.character( data_combined$Ticket );
numPassengers = nrow( data_combined );
data_combined$NumPassengersOnTicket = 1;
for (i in 1:numPassengers ){
    thisTicket = as.character( data_combined[i,"Ticket"] );
    idxPeople = which( thisTicket == ticketChar );
    numPeople = length( idxPeople );
    data_combined$NumPassengersOnTicket[i] = numPeople;
}
# now constrain to those that have just 1 person on the ticket, as passenger 1044
singlePassengers = data_combined$NumPassengersOnTicket == 1;
embarkedS = data_combined$Embarked == 'S';
unassignedCabins = data_combined$CabinAssignment == 'Unassigned';
pClass3 = data_combined$Pclass == 3;
similarPassengers = which( singlePassengers & embarkedS & unassignedCabins & pClass3 );
similarPassengersFareData = data_combined[similarPassengers,];  # include all columns but the one we have no Fare info
similarPassengersFareData = similarPassengersFareData[ similarPassengersFareData$PassengerId != 1044,];  # include all columns but the one we have no Fare info
medianFare = median( similarPassengersFareData$Fare );
data_combined[1044,"Fare"] = medianFare;
hist(similarPassengersFareData$Fare, xlab = "Fare", ylab = "Passengers", main = "Fare of Similar Passengers");

summary( similarPassengersFareData$Fare );
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.171   7.775   7.896   8.057   8.050  19.967

There are 318 passengers that share the same characteristics as passenger 1044: passenger class 3, port of embarkation Southampton, cabin assignment (unassigned), and the number of people with the same ticket (1). Excluding a fare of 3.1708 and two at 19.9667, the remaining fares are between 6.2375 and 10.5167. Passenger 1044 is assigned the median fare of 7.896.

Age

As found above, the Age data is missing for 177 of the train set cases and 86 of the test set cases. Since Age is likely to be an important predictor of survival, the missing data should be imputed. As will be shown in the next section on Feature Engineering, the passenger’s title can be extracted from their name. The title should allow for a better estimate of a passenger’s age.

The missing passengers have the following titles:

##     Capt.      Col. Countess.      Don.     Dona.       Dr. Jonkheer. 
##         0         0         0         0         0         1         0 
##     Lady.    Major.   Master.     Miss.     Mlle.      Mme.       Mr. 
##         0         0         8        50         0         0       176 
##      Mrs.       Ms.      Rev.      Sir. 
##        28         0         0         0

The resultant linear model uses the following average ages for each missing passenger title:

# model is simple: use the passenger's title (i.e., Mr., Master., Mrs.) to determine age
# May want to better this by using age of parents (if travelling with parents), age of siblings, spouse, etc.
age_lm = lm( Age~AgeTitle, data=data_combined);

missingAges = predict( age_lm, data_combined[missingAgeIndices,]);
data_combined[missingAgeIndices,"FixedAge"] = missingAges;
coeffs = coefficients( age_lm );
coeffNames = names(coeffs);
coeffValues = unname(coeffs);
# find (Intercept)
idxIntercept = match( "(Intercept)", coeffNames );
interceptValue = coeffValues[idxIntercept];
allOtherCoeffIndices = setdiff(1:length(coeffNames),idxIntercept);
allOtherCoeffValues = coeffValues[allOtherCoeffIndices];
titleAges = interceptValue + allOtherCoeffValues;
names(titleAges) =  gsub( "AgeTitle", "", coeffNames[allOtherCoeffIndices] );
knitr::kable( bind_rows(titleAges), format="markdown", longtable=TRUE)
Col. Countess. Don. Dona. Dr. Jonkheer. Lady. Major. Master. Miss. Mlle. Mme. Mr. Mrs. Ms. Rev. Sir.
54 33 40 39 43.57143 38 48 48.5 5.482641 21.77424 24 24 32.25215 36.99412 28 41.25 49

New Features

Title and AgeTitle

The Name variable in the test and train datasets has some structure - surname followed by a comma, then a title and a given name. Additionally, a maiden name will be present in parentheses if the passenger is a married woman.

passenger_names = as.character(data_combined$Name);
head(passenger_names);
## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"

The implied structure is [SURNAME] COMMA [TITLE] [GIVEN NAME] (MAIDEN NAME if applicable). Since the goal is to extract the title information from each row programmatically, it is necessary to enforce some error checking - namely, that each row contain one and only one comma. If this condition does not hold true, the title extraction task will be more tedious.

num_commas = unname(sapply( passenger_names, str_count, ","));
all_commas = assert_that( all( num_commas == 1 ) );  # Make sure each row has a comma
# Now, extract titles.  Split on comma and take the tail end of the character string
surnames = sapply( strsplit(as.character(passenger_names), ","), head, 1);
data_combined$Surname = as.factor( surnames );
given_name_string = sapply( strsplit(as.character(passenger_names), ","), tail, 1);
# Remove leading whitespace, if any
given_name_string = trimws( given_name_string, "left");
# To extract the title from the Name character string, split each string 
name_tokens = strsplit( given_name_string, "\\s+");
# Exract first token as the title
titles = lapply(name_tokens,"[[",1);
given_names = lapply(name_tokens,"[[",2);
data_combined$GivenName = as.factor( unlist( given_names ) );

The full set of extracted titles:

passenger_titles = unlist( titles );
table( passenger_titles );
## passenger_titles
##     Capt.      Col.      Don.     Dona.       Dr. Jonkheer.     Lady. 
##         1         4         1         1         8         1         1 
##    Major.   Master.     Miss.     Mlle.      Mme.       Mr.      Mrs. 
##         2        61       260         2         1       757       197 
##       Ms.      Rev.      Sir.       the 
##         2         8         1         1

Most of these look correct and reasonable as titles (“Don.” and “Dona.” are Italian honorifics, as is “Jonkheer.” a Dutch honorific). However, the title “the” is suspicious. The complete row is as follows:

data_combined[ which( data_combined$Title == "Countess." ), ]
##     PassengerId Survived Pclass
## 760         760        1      1
##                                                         Name    Sex Age
## 760 Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) female  33
##     SibSp Parch Ticket Fare Cabin Embarked CabinAssignment
## 760     0     0 110152 86.5   B77        S        Assigned
##     NumPassengersOnTicket FixedAge Surname
## 760                     3       33  Rothes
##                                                  NameString GivenName
## 760 the Countess. of (Lucy Noel Martha Dyer-Edwards) Rothes Countess.
##         Title FixedTitle  AgeTitle
## 760 Countess.       Mrs. Countess.

Although it makes little difference in this case since there is only a single “the Countess. of Rothes” on board, the title of this passenger should be changed from “the” to “Countess.”. This could be fixed by changing the single record, but the fact that this incorrect title occurred in the first place indicates that the methodology for finding titles isn’t robust enough. Since it appears that the only time a “.” appears in the Name field is at the end of a title, that information could be used to find all titles.

title_regs = regexpr( "[\\w+]+\\.", given_name_string, perl=TRUE );
re_titles = regmatches( given_name_string, title_regs );
data_combined$Title = as.factor( unlist( re_titles ));
summary( data_combined$Title );
##     Capt.      Col. Countess.      Don.     Dona.       Dr. Jonkheer. 
##         1         4         1         1         1         8         1 
##     Lady.    Major.   Master.     Miss.     Mlle.      Mme.       Mr. 
##         1         2        61       260         2         1       757 
##      Mrs.       Ms.      Rev.      Sir. 
##       197         2         8         1

Of particular interest are the titles that are underrepresented and how they might relate to the most abundant titles (Mr., Mrs., Miss., and Master.). For instance, the title “Ms.” has only two occurences and as such, has limited predictive power. “Ms.” generally refers to an adult woman, either married or unmarried and so is much closer to either a “Mrs.” or “Miss.” title than a “Mr.”, for instance. Additionally, the French designations for “Miss.” and “Mrs.” appear as Madamoiselle (“Mlle.”) and Madame (“Mme.”). These titles would likely provide better predictive power if they were switched to their English equivalents.

Additionally, many of the Age attributes are missing from the dataset and since it seems likely that Age would be a strong predictor of survivability. Title should be a fairly robust predictor for Age, so engineering a title feature that is reflective of Age could be beneficial (i.e., a Doctor with a missing Age value is much more likely to be an adult than a child and adults have a generally lower survival rate than children). The titles of the passengers missing Age information are as follows:

# The set of titles that have missing age values
ageNA = is.na( data_combined$Age );
titleNA = data_combined$Title[ageNA];
summary( titleNA );
##     Capt.      Col. Countess.      Don.     Dona.       Dr. Jonkheer. 
##         0         0         0         0         0         1         0 
##     Lady.    Major.   Master.     Miss.     Mlle.      Mme.       Mr. 
##         0         0         8        50         0         0       176 
##      Mrs.       Ms.      Rev.      Sir. 
##        27         1         0         0

For the purposes of estimating the Age variable for this group, the “Ms.” entry is better suited as a “Mrs.”. Although the “Dr.” title has a relatively small number of instances, it is likely enough from which to draw a reasonable age estimate.

# The set of titles that have missing age values
data_combined$AgeTitle = data_combined$Title;
data_combined$AgeTitle[ which( data_combined$Title == "Ms." & ageNA )] = as.factor( "Mrs." );
summary( data_combined$AgeTitle );
##     Capt.      Col. Countess.      Don.     Dona.       Dr. Jonkheer. 
##         1         4         1         1         1         8         1 
##     Lady.    Major.   Master.     Miss.     Mlle.      Mme.       Mr. 
##         1         2        61       260         2         1       757 
##      Mrs.       Ms.      Rev.      Sir. 
##       198         1         8         1

Parents and Children

Next, the Parch variable encodes two different measures: the number of parents a passenger has on board AND/OR the number of children a passenger has on board. Since this variable is overloaded, a more informative variable set might disambiguate these measures into NumParents and NumChildren (aboard). Such a set of variables may provide more predictive power than the single Parch variable.

numParents = integer( nrow( data_combined ) );
numChildren = integer( nrow( data_combined ) );

MAX_CHILD_AGE = 14;

# Find all Parch > 0
posParch = data_combined$Parch > 0;
idxParch = which( posParch );
counter = 0;
numCases = length( idxParch );
for (thisRow in idxParch){
  sibsp = data_combined[thisRow,"Sibsp"];
  # if sibsp > 1, then the passenger is travelling with siblings (and therefore, likely parents)
  counter = counter + 1;
  numParch = data_combined[thisRow,"Parch"];
  thisSurname = data_combined[thisRow,"Surname"];
  thisTicket = data_combined[thisRow,"Ticket"];
  thisTitle = as.character( data_combined[thisRow,"Title"] );
  thisAge = data_combined[thisRow,"Age"];
  if ( (thisTitle == "Master.") || (thisTitle == "Miss.") ){
    # if this passenger is a "Master." or Miss. and numParch <= 2, he/she must be someone's child
    numParents[thisRow] = numParch;
    #next;
  }
  if (!is.na( thisAge ) && thisAge <= MAX_CHILD_AGE ){
    # if this passenger is young, declare they cannot be parents, must be a child
    numParents[thisRow] = numParch;
    #next;
  }
  # get passenger rows on same ticket
  sameTickets = data_combined$Ticket == thisTicket;
  sameSurnames = data_combined$Surname == thisSurname;
  sameSurnameSameTicket = sameTickets & sameSurnames & posParch;
  ticketTitles = as.character( data_combined[ sameSurnameSameTicket, "Title"] );
  ticketParches = data_combined[ sameSurnameSameTicket, "Parch"];
  # now, look for passengers with the same surname on the ticket and check their titles and ages
  numSame = length( sameSurnameSameTicket );
  ages = sort( data_combined[sameSurnameSameTicket,"Age"] );
  thisAgePos = which( thisAge == ages )[1];  # find index of thisAge in sorted ages
  gaps = diff( ages );
  numGenerationalGaps = length( which( gaps > MAX_CHILD_AGE) );
  if ( numGenerationalGaps == 0 ){
    if ( !is.na( thisAge )){
      if ( ( thisAge >= 40 ) || (thisTitle == "Mrs." ) ){
        numChildren[thisRow] = numParch;
      }
      else if (numParch > 2 ){
        numChildren[thisRow] = numParch;
      }
      else{
        numParents[thisRow] = numParch;
      }
    }
    else{
      # no age information. If travelling with "kids", and title isn't a kid, then parent
      if ( ! ( thisTitle %in% c("Master.","Miss.") ) ){
        if( any( c("Master.","Miss.") %in% ticketTitles ) ){
          # now, if there are two Mr. in this group, we need to choose the real father
          # If there are three or more children, then Parch will be greater than 2 and
          # will indicate this is the father.  Else, it will be a child
          if( thisTitle == "Mr."){
            maxParches = max( ticketParches );
            if ( ( maxParches > 2 ) && ( maxParches == numParch ) ){
              numChildren[thisRow] = numParch;
            }
            else{
              numParents[thisRow] = numParch;
            }
          }
          else{
            numChildren[thisRow] = numParch;
          }
        }
        else{
          # all we have is a non Mr. or Miss. title.  Make them a child
            numChildren[thisRow] = numParch;
        }
      }
      else{
        numParents[thisRow] = numParch;
      }
    }
  }
  else{
    if ( !is.na(thisAge) ){  # use age in comparison to generation gap to classify kids/parents
      maxGapPos = which.max( gaps ) + 0.5;  # the 0.5 puts it in the middle of the kids/parents
      if ( thisAgePos < maxGapPos ){
        numParents[thisRow] = numParch;
      }
      else{
        numChildren[thisRow] = numParch;
      }
    }
    else{
      # no age info, have to go with titles
    }
  }
  totalParch = numParents[thisRow] + numChildren[thisRow];
  if ( totalParch != numParch ){
    stop( "Number of Parents/Children assigned (", totalParch, ") does not equal the Parch variable (", numParch, ") for passenger ", data_combined[thisRow,"PassengerId"], "\n");
  }
  data_combined$NumParents = numParents;
  data_combined$NumChildren = numChildren;
}

Siblings and Spouses

Similar to Parch, the SibSp variable indicates the number of Siblings AND/OR Spouses a passenger has on board. A better set of variables would be separate variables for Siblings and Spouses as it would be possible to model the original SibSp vector as a simple combination of the separate columns.

numSiblings = integer( nrow( data_combined ) );
numSpouses = integer( nrow( data_combined ) );

MAX_CHILD_AGE = 14;

posSibSp = data_combined$SibSp > 0;
idxSibSp = which( posSibSp );
counter = 0;
numCases = length( idxSibSp );
for (thisRow in idxSibSp){
  counter = counter + 1;
  numSibSp = data_combined[thisRow,"SibSp"];
  numParch = data_combined[thisRow,"Parch"];
  thisSurname = data_combined[thisRow,"Surname"];
  thisTicket = data_combined[thisRow,"Ticket"];
  thisTitle = as.character( data_combined[thisRow,"Title"] );
  thisAge = data_combined[thisRow,"Age"];
  thisSex = as.character( data_combined[thisRow,"Sex"] );
  thisGivenName = as.character( data_combined[thisRow,"GivenName"] );
  thisPassengerId = data_combined[thisRow,"PassengerId"];
  if ( (thisTitle == "Master.") || (thisTitle == "Miss.") ){
    # if this passenger is a "Master." or Miss., this passenger is not married, so must be spouse
    numSiblings[thisRow] = numSibSp;
    next;
  }
  if (!is.na( thisAge ) && thisAge <= 10 ){
    # if this passenger is young, declare they cannot be parents, must be a child
    numSiblings[thisRow] = numSibSp;
    next;
  }
  # get passenger rows on same ticket
  sameTickets = data_combined$Ticket == thisTicket;
  sameSurnames = data_combined$Surname == thisSurname;
  sameSurnameSameTicket = sameTickets & sameSurnames & posSibSp;
  # if married, look for spouse on the same ticket
  # get given names on same ticket
  same_ticket_rows = data_combined[which(sameSurnameSameTicket),];
  given_names = as.character( same_ticket_rows$GivenName );
  titles = as.character( same_ticket_rows$Title );
  sexes = as.character( same_ticket_rows$Sex );
  passengerIds = same_ticket_rows$PassengerId;
  master_mask = ( titles != "Master." );  
  sex_mask = ( sexes != thisSex );  # opposite sex marriage
  miss_mask = ( titles != "Miss." );
  this_mask = ( passengerIds != thisPassengerId );
  idxMatch = which( ( thisGivenName == given_names ) & master_mask & sex_mask & miss_mask & this_mask);
  if ( length( idxMatch ) == 1 ){
    numSpouses[thisRow] = 1;
    next;
  }
  ticketTitles = as.character( data_combined[ sameSurnameSameTicket, "Title"] );
  ticketParches = data_combined[ sameSurnameSameTicket, "Parch"];
  # now, look for passengers with the same surname on the ticket and check their titles and ages
  numSame = length( sameSurnameSameTicket );
  ages = sort( data_combined[sameSurnameSameTicket,"Age"] );
  thisAgePos = which( thisAge == ages )[1];  # find index of thisAge in sorted ages
  gaps = diff( ages );
  numGenerationalGaps = length( which( gaps > MAX_CHILD_AGE) );
}

# do some manual corrections
numSiblings[168] = 0;  # Mrs. William Skoog
numSpouses[168] = 1;  # Mrs. William Skoog
numSiblings[361] = 0;  # Mr. Wilhelm Skoog
numSpouses[361] = 1;  # Mr. Wilhelm Skoog
numSiblings[679] = 0;  # Mrs. Frederick Goodwin
numSpouses[679] = 1;  # Mrs. Frederick Goodwin
numSiblings[1031] = 0;  # Mr. Charles Frederick Goodwin
numSpouses[1031] = 1;  # Mr. Charles Frederick Goodwin
numSiblings[557] = 0;  # Lady. Duff Gordon
numSpouses[557] = 1;  # Lady. Duff Gordon
numSiblings[600] = 0;  # Sir. Duff Gordon
numSpouses[600] = 1;  # Sir. Duff Gordon
numSiblings[746] = 0;  # Capt. Edward Gifford Crosby
numSpouses[746] = 1;  # Capt. Edward Gifford Crosby
numSiblings[1197] = 0;  # Mrs. Edward Gifford Crosby
numSpouses[1197] = 1;  # Mrs. Edward Gifford Crosby
numSiblings[1059] = 3;  # Mr. Edward Watson Ford
numSpouses[1059] = 0;  # Mr. Edward Watson Ford

counter = 0;
for (thisRow in idxSibSp){
  counter = counter + 1;
  numSibSp = data_combined[thisRow,"SibSp"];
  numParch = data_combined[thisRow,"Parch"];
  thisSurname = data_combined[thisRow,"Surname"];
  thisTicket = data_combined[thisRow,"Ticket"];
  thisTitle = as.character( data_combined[thisRow,"Title"] );
  thisAge = data_combined[thisRow,"Age"];
  thisSex = as.character( data_combined[thisRow,"Sex"] );
  thisGivenName = as.character( data_combined[thisRow,"GivenName"] );
  thisPassengerId = data_combined[thisRow,"PassengerId"];
  numSiblings[thisRow] = numSibSp - numSpouses[thisRow];
}

data_combined$Spouses = numSpouses;
data_combined$SpousesFactor = factor( numSpouses );
data_combined$Siblings = numSiblings;

FarePerPassenger

As determined in the EDA section, the Fare variable is not per passenger, it is the price paid for the ticket the passenger is traveling under (and multiple passengers may travel on the same ticket).

ticketChar = as.character( data_combined$Ticket );
uniqueTickets = unique( ticketChar );
farePerPassenger = data_combined$Fare;
for (i in 1:length(uniqueTickets) ){
    thisTicket = uniqueTickets[i];
    idxPeople = which( ticketChar == thisTicket );
    theseFares = data_combined[idxPeople,"Fare"];
    #if ( !all( theseFares == theseFares[1] ) ){
    #    cat( thisTicket )
    #}
    # should only be a single fare
    
    numPeople = length( idxPeople );
    farePerPassenger[idxPeople] = theseFares[1]/numPeople;
}
data_combined$FarePerPassenger = farePerPassenger;

Models

This is a classification problem with two classes: { Died, Survived }. This is encoded in the ‘Survived’ variable in the train dataset.

Null Model

## Of  891  passengers,  549  died and  342  survived.
## Mean Survivability =  0.3838384

The simplest model would be to assume that all test subjects are members of the most common class in the train dataset. In the train dataset, 62% of the subjects died, so this simple model would assume that all the test passengers die. This would yield a prediction accuracy of about 62%, and a corresponding misclassification rate of 38%, all false negatives.

This simple model is also the fastest to implement and serves as a good initial benchmark from which to improve upon.

Upon submission to the Kaggle site, the model yielded a score of 0.62679. In terms of the bias/variance tradeoff, this is a very biased model (very inflexible) and the variance will likely be very small (i.e., the choice of training set will not affect predictions on the test set except in rare cases where more survivors are selected than non-survivors).

Gender Submission Model

Along with the train.csv and test.csv file, Kaggle provides an additional file called ‘gender_submission.csv’. This file is a sample submission to Kaggle in which all the female passengers are predicted to survive while all of the male passengers are predicted to die.

In the training set, of the 891 passengers, 342 are female. 233 of the 342 female passengers in the training set survived (74.2%). Of the 549 male passengers in the train set, 468 did not survive (81.1%). This yields an overall classification rate of 78.7% for the test set. On the bias/variance continuum, this model is very biased and should have small variance.

Upon submission to Kaggle, this model scored 0.76555.

Women and Children First Model

The canonical model for a sinking ship is that women and children get preferential access to lifeboats while men are expected to defer. If 100% accordance to this model is assumed and all women and children are boarded onto lifeboats while the men are left to fight it out for flotsam in the frigid waters, women and children would be expected to survive while the men would perish.

Since the Age variable is missing for 177 of the train set cases and 86 test set cases, the imputed Age values from above will be used. For this application, a child will be defined as a person under 15 years of age.

## Warning: package 'caret' was built under R version 3.4.2

In the training set, the proportion of correct predictions is 0.79.

Upon submission to Kaggle, the Women and Children First model scored 0.76076.

Linear Discriminant Analysis (LDA) Models

In Linear Discriminant Analysis, the distributions of each of the predictors (Passenger Class, Age, Sex, etc.) are modeled separately for the Survived and Died classes. Then, Bayes’ theorem can be flipped around to get estimates for the probability of Survived or Died given the set of predictors. LDA assumes that the observations in each class (Survived, Died) are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a covariance matrix common to both classes. This assumption does not hold for some of the predictors in this analysis, as not only are they not all normal, some are categorical (Sex, Embarked). Nevertheless, the implementation in R converts the variables to a numeric albeit not-normal representation.

LDA Model using all features

The first LDA model is one in which all predictors are considered: pClass, Sex, FixedAge, FarePerPassenger, FixedTitle, CabinAssignment, Embarked, NumPassengersOnTicket, NumParents, NumChildren, SpousesFactor, and Siblings. Against the training data, the model scored 0.8373 (~84% accuracy). When run against the test set, the model scored 0.78947.

library(MASS, quietly=TRUE);
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
xTrain = whiteTrain[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = whiteTest[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
lda.fit = lda( SurvivedFactor ~., data=xTrain);
lda.fit
## Call:
## lda(SurvivedFactor ~ ., data = xTrain)
## 
## Prior probabilities of groups:
##         0         1 
## 0.6161616 0.3838384 
## 
## Group means:
##     pclass2   pclass3   Sexmale   FixedAge FarePerPassenger
## 0 0.1766849 0.6775956 0.8524590  0.0586912       -0.2427148
## 1 0.2543860 0.3479532 0.3187135 -0.1214275        0.3509019
##   FixedTitleMiss. FixedTitleMr. FixedTitleMrs. CabinAssignmentUnassigned
## 0       0.1001821     0.7941712     0.07468124                 0.8761384
## 1       0.3713450     0.2368421     0.32456140                 0.6023392
##   EmbarkedQ EmbarkedS NumPassengersOnTicket  NumParents   NumChildren
## 0 0.0856102 0.7777778            -0.1662175 -0.04990379 -0.0429864357
## 1 0.0877193 0.6403509             0.2942616  0.16516206 -0.0002207613
##   SpousesFactor1    Siblings
## 0      0.1220401  0.09339948
## 1      0.2076023 -0.07771321
## 
## Coefficients of linear discriminants:
##                                   LD1
## pclass2                   -0.05937873
## pclass3                   -0.63312860
## Sexmale                   -2.58719687
## FixedAge                  -0.17944828
## FarePerPassenger           0.14185791
## FixedTitleMiss.           -2.59463695
## FixedTitleMr.             -2.09196720
## FixedTitleMrs.            -2.12049380
## CabinAssignmentUnassigned -0.51423840
## EmbarkedQ                 -0.02540707
## EmbarkedS                 -0.24815654
## NumPassengersOnTicket      0.06028440
## NumParents                -0.02647490
## NumChildren               -0.19587549
## SpousesFactor1            -0.20253519
## Siblings                  -0.33913849
plot( lda.fit );

lda.train.pred = predict( lda.fit, xTrain );  
ldaTrainPredictions = as.numeric( lda.train.pred$class ) - 1;
truth = newTrain$Survived;
confusionMatrix( ldaTrainPredictions, truth );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 488  84
##          1  61 258
##                                           
##                Accuracy : 0.8373          
##                  95% CI : (0.8114, 0.8609)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6515          
##  Mcnemar's Test P-Value : 0.0677          
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.7544          
##          Pos Pred Value : 0.8531          
##          Neg Pred Value : 0.8088          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5477          
##    Detection Prevalence : 0.6420          
##       Balanced Accuracy : 0.8216          
##                                           
##        'Positive' Class : 0               
## 
# scored a 0.8373 against the training data
lda.test.pred = predict( lda.fit, xTest );  
# lda.pred$class is a factor; need to convert to 0, 1
ldaPredictions = as.numeric( lda.test.pred$class ) - 1;
ldaModelTrain = data.frame(PassengerId=train$PassengerId, Survived=ldaTrainPredictions);
ldaModel = data.frame(PassengerId=test$PassengerId, Survived=ldaPredictions);

write.csv( ldaModelTrain, "LDAModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModelTrain, "LDAModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModel, "LDAModel.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModel, "LDAModelProbs.csv", row.names=FALSE, quote=FALSE );
# scored a 0.78947

LDA Model with only significant features

In subsequent analyses (logistic regression, random forests), the predictors are measured for effect on the response variable. The logistic regression analysis indicates which covariates have a low p-value (i.e., <0.05), indicating a significant effect. Similarly, random forests produce a variable importance measure which shows which covariates are used most often in tree generation. These measures give a good idea of which variables are likely to have the most predictive power. Using these variables and leaving out the variables with higher p-values and those not often used in random forests will likely reduce overfitting. The variables with the best predictive power in this set are pClass, Sex, FixedAge, FarePerPassenger, FixedTitle, CabinAssignment, NumChildren, SpousesFactor, and Siblings.

The LDA model using only significant features scored a 0.8339 against the training data and, similar to the full-featured LDA model, scored a 0.78947.

library(MASS);
xTrain = whiteTrain[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","NumChildren","SpousesFactor","Siblings")];
xTest = whiteTest[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","NumChildren","SpousesFactor","Siblings")];
lda.fit = lda( SurvivedFactor ~., data=xTrain);
plot( lda.fit );

lda.train.pred = predict( lda.fit, xTrain );  
ldaTrainPredictions = as.numeric( lda.train.pred$class ) - 1;
truth = newTrain$Survived;
confusionMatrix( ldaTrainPredictions, truth );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 490  89
##          1  59 253
##                                           
##                Accuracy : 0.8339          
##                  95% CI : (0.8078, 0.8578)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6429          
##  Mcnemar's Test P-Value : 0.01714         
##                                           
##             Sensitivity : 0.8925          
##             Specificity : 0.7398          
##          Pos Pred Value : 0.8463          
##          Neg Pred Value : 0.8109          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5499          
##    Detection Prevalence : 0.6498          
##       Balanced Accuracy : 0.8161          
##                                           
##        'Positive' Class : 0               
## 
# scored a 0.8339 against the training data
lda.test.pred = predict( lda.fit, xTest );  
# lda.pred$class is a factor; need to convert to 0, 1
ldaPredictions = as.numeric( lda.test.pred$class ) - 1;
ldaModelTrain = data.frame(PassengerId=train$PassengerId, Survived=ldaTrainPredictions);
ldaModel = data.frame(PassengerId=test$PassengerId, Survived=ldaPredictions);

write.csv( ldaModelTrain, "LDAReducedModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModelTrain, "LDAReducedModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModel, "LDAReducedModel.csv", row.names=FALSE, quote=FALSE );
write.csv( ldaModel, "LDAReducedModelProbs.csv", row.names=FALSE, quote=FALSE );
# scored a 0.78947

Quadratic Discriminant Analysis (QDA) Models

Quadratic Discriminant Analysis (QDA) is very much similar to LDA but with one difference: there is no assumption that the classes have a common variance. In this model, each class has its own covariance matrix. QDA is so named because the discriminant functions are quadratic in the predictor vector, as opposed to linear with LDA. Since the discriminant function is quadratic, the decision boundary is not constrained to a line as with LDA and can assume curved shapes.

QDA Model with all features

The QDA model with all variables scored a 0.8159 against the training set and 0.77033 against the test data.

library(MASS);
xTrain = whiteTrain[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = whiteTest[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
qda.fit = qda( SurvivedFactor ~., data=xTrain);
qda.train.pred = predict( qda.fit, xTrain );  
qdaTrainPredictions = as.numeric( qda.train.pred$class ) - 1;
truth = newTrain$Survived;
confusionMatrix( qdaTrainPredictions, truth );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 475  90
##          1  74 252
##                                           
##                Accuracy : 0.8159          
##                  95% CI : (0.7889, 0.8409)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6074          
##  Mcnemar's Test P-Value : 0.2415          
##                                           
##             Sensitivity : 0.8652          
##             Specificity : 0.7368          
##          Pos Pred Value : 0.8407          
##          Neg Pred Value : 0.7730          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5331          
##    Detection Prevalence : 0.6341          
##       Balanced Accuracy : 0.8010          
##                                           
##        'Positive' Class : 0               
## 
# scored a 0.8159 against the training data
qda.test.pred = predict( qda.fit, xTest );  
# lda.pred$class is a factor; need to convert to 0, 1
qdaPredictions = as.numeric( qda.test.pred$class ) - 1;
qdaModelTrain = data.frame(PassengerId=train$PassengerId, Survived=qdaTrainPredictions);
qdaModel = data.frame(PassengerId=test$PassengerId, Survived=qdaPredictions);

write.csv( qdaModelTrain, "QDAModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModelTrain, "QDAModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModel, "QDAModel.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModel, "QDAModelProbs.csv", row.names=FALSE, quote=FALSE );
# scored a 0.77033

QDA Model with only significant features

The QDA model using only significant features scored a 0.7845 against the training set and 0.78947 against the test data.

library(MASS);
xTrain = whiteTrain[,c("SurvivedFactor","pclass","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","NumChildren","Siblings")];
xTest = whiteTest[,c("SurvivedFactor","pclass","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","NumChildren","Siblings")];
qda.fit = qda( SurvivedFactor ~., data=xTrain);
qda.train.pred = predict( qda.fit, xTrain );  
qdaTrainPredictions = as.numeric( qda.train.pred$class ) - 1;
truth = newTrain$Survived;
confusionMatrix( qdaTrainPredictions, truth );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 444  87
##          1 105 255
##                                          
##                Accuracy : 0.7845         
##                  95% CI : (0.756, 0.8111)
##     No Information Rate : 0.6162         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.5489         
##  Mcnemar's Test P-Value : 0.2199         
##                                          
##             Sensitivity : 0.8087         
##             Specificity : 0.7456         
##          Pos Pred Value : 0.8362         
##          Neg Pred Value : 0.7083         
##              Prevalence : 0.6162         
##          Detection Rate : 0.4983         
##    Detection Prevalence : 0.5960         
##       Balanced Accuracy : 0.7772         
##                                          
##        'Positive' Class : 0              
## 
# scored a 0.7845 against the training data
qda.test.pred = predict( qda.fit, xTest );  
# lda.pred$class is a factor; need to convert to 0, 1
qdaPredictions = as.numeric( qda.test.pred$class ) - 1;
qdaModelTrain = data.frame(PassengerId=train$PassengerId, Survived=qdaTrainPredictions);
qdaModel = data.frame(PassengerId=test$PassengerId, Survived=qdaPredictions);

write.csv( qdaModelTrain, "QDASigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModelTrain, "QDASigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModel, "QDASigModel.csv", row.names=FALSE, quote=FALSE );
write.csv( qdaModel, "QDASigModelProbs.csv", row.names=FALSE, quote=FALSE );
# scored a 0.78947

Logistic Regression Models

In a linear regression model, the probability of the response variable is modeled as a linear combination of predictors and multiplicative coefficients. The encoding of the response variable (Survived) as 0 and 1 allows for a linear regression model but covariates that produce a response greater than 1 and less than 0 are considered errors and can negatively impact parameter estimation. In logistic regression, the probability of the response variable is modeled with the logistic function, or y = eX/(1+eX). Logistic regression is a classic model for binary classification.

Logistic Regression using all features

A logistic regression model is fit using the train data. Selected variables are shown below. Some variables could be treated as either numeric or factors (i.e., NumPassengersOnTicket and NumChildren). Because there was a natural ordering of these variables, they are treated as numeric values rather than factors. This was done to capture the linear dependence of each variable to the survival response.

all_features = c("pclass", "Sex", "FixedAge", "FarePerPassenger", "FixedTitle", "Siblings", "CabinAssignment", "Embarked", "NumPassengersOnTicket", "NumParents", "NumChildren", "SpousesFactor");
full_df = newTrain[, c("Survived", all_features)];
full.glm.fit = glm( Survived~., data=full_df, family=binomial);
summary( full.glm.fit );
## 
## Call:
## glm(formula = Survived ~ ., family = binomial, data = full_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4086  -0.5641  -0.3761   0.5328   2.6498  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                7.50564    1.06421   7.053 1.75e-12 ***
## pclass2                   -0.28984    0.47212  -0.614  0.53928    
## pclass3                   -1.25993    0.49740  -2.533  0.01131 *  
## Sexmale                   -3.60602    0.65073  -5.541 3.00e-08 ***
## FixedAge                  -0.02796    0.01022  -2.736  0.00622 ** 
## FarePerPassenger           0.01514    0.01310   1.156  0.24775    
## FixedTitleMiss.           -4.08170    0.84886  -4.808 1.52e-06 ***
## FixedTitleMr.             -3.34786    0.58437  -5.729 1.01e-08 ***
## FixedTitleMrs.            -3.20474    0.82194  -3.899 9.66e-05 ***
## Siblings                  -0.71829    0.17065  -4.209 2.56e-05 ***
## CabinAssignmentUnassigned -0.80448    0.36274  -2.218  0.02657 *  
## EmbarkedQ                 -0.06568    0.40009  -0.164  0.86961    
## EmbarkedS                 -0.41933    0.25383  -1.652  0.09852 .  
## NumPassengersOnTicket      0.10632    0.09815   1.083  0.27875    
## NumParents                -0.23028    0.26983  -0.853  0.39341    
## NumChildren               -0.48163    0.18715  -2.574  0.01007 *  
## SpousesFactor1            -0.39542    0.30922  -1.279  0.20097    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  714.88  on 874  degrees of freedom
## AIC: 748.88
## 
## Number of Fisher Scoring iterations: 5
trainSetProbs = predict( full.glm.fit, newTrain, type="response");  # now a vector of probabilities
# probabilities >= 0.5 mean survived, < 0.5 mean perished
# training set performance
threshold = 0.5;
predictions = as.numeric( trainSetProbs >= threshold );
confusionMatrix( predictions, newTrain$Survived );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 485  84
##          1  64 258
##                                           
##                Accuracy : 0.8339          
##                  95% CI : (0.8078, 0.8578)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6449          
##  Mcnemar's Test P-Value : 0.1183          
##                                           
##             Sensitivity : 0.8834          
##             Specificity : 0.7544          
##          Pos Pred Value : 0.8524          
##          Neg Pred Value : 0.8012          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5443          
##    Detection Prevalence : 0.6386          
##       Balanced Accuracy : 0.8189          
##                                           
##        'Positive' Class : 0               
## 
# now, how'd we do against the test set?
testSetProbs = predict( full.glm.fit, newTest, type="response");
fullLogisticPredictions = as.numeric( testSetProbs >= threshold );
fullLogisticRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=predictions);
fullLogisticRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=fullLogisticPredictions);

fullLogisticRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=trainSetProbs);
fullLogisticRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=testSetProbs);

write.csv( fullLogisticRegressionModelTrainProbs, "LogisticRegressionFullModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLogisticRegressionModelProbs, "LogisticRegressionFullModelProbs.csv", row.names=FALSE, quote=FALSE );

write.csv( fullLogisticRegressionModelTrain, "LogisticRegressionFullModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLogisticRegressionModel, "LogisticRegressionFullModel.csv", row.names=FALSE, quote=FALSE );
# scored a 0.78947 - not expected

The logistic regression model was first fit with all 12 possible covariates. After fitting the model to the training data, the model was run on the training set yielding a training set test error of 0.1661055 (0.8338945 accuracy rate). The model had an overall accuracy of 0.78947 on the test set (the score from Kaggle). Since this was lower than the expected score (the training set yielded a score of 0.8338945), it appears the model may be overfitting the training data.

library(caret);
library(lattice);
cat( "Training Set performance:\n");
## Training Set performance:
confusionMatrix( predictions, newTrain$Survived);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 485  84
##          1  64 258
##                                           
##                Accuracy : 0.8339          
##                  95% CI : (0.8078, 0.8578)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6449          
##  Mcnemar's Test P-Value : 0.1183          
##                                           
##             Sensitivity : 0.8834          
##             Specificity : 0.7544          
##          Pos Pred Value : 0.8524          
##          Neg Pred Value : 0.8012          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5443          
##    Detection Prevalence : 0.6386          
##       Balanced Accuracy : 0.8189          
##                                           
##        'Positive' Class : 0               
## 

Because the logistic regression model with all 12 predictors has a test set error somewhat larger than the training set error, the model is likely fitting noise in the training set. It’s very likely that some of the features are co-linear, for instance, Passenger Class (pclass) and FarePerPassenger. If the number of predictors can be reduced, the variance of the model can likely be lowered. After running the full set of predictors, the model shows several variables as significant: FixedTitle, Sex, and Siblings. Other variables that show nearly significant correlation to the response are FixedAge, pclass, CabinAssignment, and NumChildren. If those variables are brought forward and tested separately in different combinations, a model with a lower cross validation error might result. As a final measure, a separate model was run with the remaining “leftover” features: FarePerPassenger, Embarked, NumPassengersOnTicket, SpousesFactor, and NumParents. The idea is to see if any significant features can be determined in the absence of the significant features of the full model.

leftoverFeatures = c("FarePerPassenger","Embarked","NumPassengersOnTicket","SpousesFactor","NumParents");
leftover_df = newTrain[, c("Survived", leftoverFeatures)];
leftover.glm.fit = glm( Survived ~., data = leftover_df, family=binomial);
summary(leftover.glm.fit);
## 
## Call:
## glm(formula = Survived ~ ., family = binomial, data = leftover_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6601  -0.8987  -0.7557   1.1841   1.9004  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -0.955763   0.234702  -4.072 4.66e-05 ***
## FarePerPassenger       0.048603   0.007031   6.912 4.77e-12 ***
## EmbarkedQ              0.095578   0.301596   0.317 0.751314    
## EmbarkedS             -0.483531   0.194625  -2.484 0.012976 *  
## NumPassengersOnTicket -0.046748   0.053266  -0.878 0.380145    
## SpousesFactor1         0.520507   0.213264   2.441 0.014660 *  
## NumParents             0.594420   0.173922   3.418 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance: 1080.1  on 884  degrees of freedom
## AIC: 1094.1
## 
## Number of Fisher Scoring iterations: 4

Logistic Regression using only significant features

The logistic regression model with the “leftover” features yielded two new feature possibilities: FarePerPassenger and NumParents. These features are likely proxies for passenger class (pclass) and age (FixedAge). i.e., passengers travelling with their parents are more likely to be children and thus, have a higher survival probability. Similarly, FarePerPassenger is likely to reflect the passenger class separation. However, unlike pclass, FarePerPassenger is a continuous variable and possibly more informative than the three passenger classes. So, combined with the significant features from the full logistic regression model, a set of features can be drawn upon in different combinations in order to select the model with the lowest cross validation error.

library(boot, quietly = TRUE);
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
significant_features = c("FixedTitle", "Sex", "Siblings", "FixedAge", "CabinAssignment", "pclass", "NumChildren", "FarePerPassenger", "Embarked");
all_combinations_of_significant_features = lapply(1:length(significant_features), function(x) combn(length(significant_features),x));
numFeatureSetClasses = length( all_combinations_of_significant_features );
correct = list();
accuracies = list();
errors = list();
features = list();
numFeatures = list();
counter = 1;
for ( featureClass in 1:numFeatureSetClasses){
  featClassColumns = dim( all_combinations_of_significant_features[[featureClass]] )[2];
  for ( thisCol in 1:featClassColumns ){
    feature_set = significant_features[ all_combinations_of_significant_features[[featureClass]][,thisCol]];
    nFeatures = length( feature_set );
    df = newTrain[, c("Survived", feature_set)]
    glm.fit = glm( Survived~.,data=df,family=binomial);
    cv.glm.fit = cv.glm(df,glm.fit,K=10);
    cv.error = cv.glm.fit$delta[1];
    trainSetProbs = predict( glm.fit, newTrain, type="response");  # now a vector of probabilities
    predictions = as.numeric( trainSetProbs >= threshold );
    numRight = sum( predictions == newTrain$Survived );
    accuracy = mean( predictions == newTrain$Survived );
    features[counter] = list( feature_set );
    numFeatures[counter] = nFeatures;
    accuracies[counter] = accuracy;
    correct[counter] = numRight;
    errors[counter] = cv.error;
    counter = counter + 1;
  }
}
cvResults = data.frame( NumFeatures=unlist(numFeatures), CVError=unlist(errors), TrainingScore=unlist(accuracies), NumCorrect=unlist(correct), Features=I(features) );
cvResults = cvResults[order(cvResults$CVError),];
knitr::kable( head(cvResults,n=20L), format="markdown", longtable=TRUE)
NumFeatures CVError TrainingScore NumCorrect Features
466 7 0.1278208 0.8305275 740 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren
503 8 0.1284761 0.8395062 748 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren, Embarked
502 8 0.1284869 0.8282828 738 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren, FarePerPassenger
473 7 0.1287440 0.8350168 744 FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren, Embarked
467 7 0.1288112 0.8260382 736 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger
511 9 0.1291203 0.8350168 744 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren, FarePerPassenger, Embarked
468 7 0.1292210 0.8338945 743 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, Embarked
382 6 0.1294414 0.8271605 737 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass
386 6 0.1294508 0.8249158 735 FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren
257 5 0.1297716 0.8249158 735 FixedTitle, Sex, Siblings, FixedAge, pclass
388 6 0.1299351 0.8271605 737 FixedTitle, Sex, Siblings, FixedAge, pclass, Embarked
506 8 0.1300108 0.8327722 742 FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren, FarePerPassenger, Embarked
507 8 0.1300126 0.8260382 736 FixedTitle, Sex, Siblings, CabinAssignment, pclass, NumChildren, FarePerPassenger, Embarked
392 6 0.1300478 0.8282828 738 FixedTitle, Sex, Siblings, CabinAssignment, pclass, NumChildren
477 7 0.1300765 0.8316498 741 FixedTitle, Sex, Siblings, CabinAssignment, pclass, NumChildren, Embarked
504 8 0.1301593 0.8294052 739 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger, Embarked
474 7 0.1303197 0.8271605 737 FixedTitle, Sex, Siblings, FixedAge, pclass, FarePerPassenger, Embarked
476 7 0.1306408 0.8249158 735 FixedTitle, Sex, Siblings, CabinAssignment, pclass, NumChildren, FarePerPassenger
472 7 0.1309137 0.8204265 731 FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren, FarePerPassenger
265 5 0.1310824 0.8282828 738 FixedTitle, Sex, Siblings, pclass, NumChildren

Of the 512 different feature combinations, the top 20 are listed. The cross validation error estimates were made using 10-fold cross validation. The top 10 are retested using LOOCV:

correct = list();
accuracies = list();
errors = list();
features = list();
numFeatures = list();
counter = 1;
for ( rowNum in 502:511){
    feature_set = cvResults$Features[[rowNum]];
    nFeatures = cvResults$NumFeatures[rowNum];
    df = newTrain[, c("Survived", feature_set)]
    glm.fit = glm( Survived~.,data=df,family=binomial);
    cv.glm.fit = cv.glm(df,glm.fit);
    cv.error = cv.glm.fit$delta[1];
    trainSetProbs = predict( glm.fit, newTrain, type="response");  # now a vector of probabilities
    predictions = as.numeric( trainSetProbs >= threshold );
    numRight = sum( predictions == newTrain$Survived );
    accuracy = mean( predictions == newTrain$Survived );
    features[counter] = list( feature_set );
    numFeatures[counter] = nFeatures;
    accuracies[counter] = accuracy;
    correct[counter] = numRight;
    errors[counter] = cv.error;
    counter = counter + 1;
}
loocvResults = data.frame( NumFeatures=unlist(numFeatures), CVError=unlist(errors), TrainingScore=unlist(accuracies), NumCorrect=unlist(correct), Features=I(features) );
loocvResults = loocvResults[order(loocvResults$CVError),];
knitr::kable( head(loocvResults,n=10L), format="markdown", longtable=TRUE)
NumFeatures CVError TrainingScore NumCorrect Features
2 1 0.2312228 0.6363636 567 Embarked
3 3 0.2312587 0.6318743 563 Siblings, NumChildren, Embarked
1 2 0.2316473 0.6363636 567 NumChildren, Embarked
4 2 0.2327469 0.6285073 560 Siblings, FixedAge
5 3 0.2329559 0.6262626 558 Siblings, FixedAge, NumChildren
7 1 0.2356697 0.6161616 549 FixedAge
6 2 0.2358816 0.6150393 548 FixedAge, NumChildren
8 1 0.2361338 0.6161616 549 Siblings
10 2 0.2366439 0.6161616 549 Siblings, NumChildren
9 1 0.2374988 0.6161616 549 NumChildren

The model with the lowest cross validation error is a model with 8 features: FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren, and Embarked. A logistic regression model is fit using these features and submitted to Kaggle.

best_features = c("pclass", "Sex", "FixedAge", "FixedTitle", "Siblings", "CabinAssignment", "Embarked", "NumChildren" );
best_df = newTrain[, c("Survived", best_features)];
best.glm.fit = glm( Survived~., data=best_df, family=binomial);
trainSetProbs = predict( best.glm.fit, newTrain, type="response");  # now a vector of probabilities
# probabilities >= 0.5 mean survived, < 0.5 mean perished
threshold = 0.5;
predictions = as.numeric( trainSetProbs >= threshold );
summary( best.glm.fit );
## 
## Call:
## glm(formula = Survived ~ ., family = binomial, data = best_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3843  -0.5688  -0.3754   0.5284   2.6770  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                7.956973   0.916660   8.680  < 2e-16 ***
## pclass2                   -0.579907   0.398655  -1.455  0.14576    
## pclass3                   -1.594762   0.391739  -4.071 4.68e-05 ***
## Sexmale                   -3.547781   0.639796  -5.545 2.94e-08 ***
## FixedAge                  -0.026580   0.009689  -2.743  0.00608 ** 
## FixedTitleMiss.           -3.980327   0.840060  -4.738 2.16e-06 ***
## FixedTitleMr.             -3.300974   0.573629  -5.755 8.69e-09 ***
## FixedTitleMrs.            -3.235773   0.813646  -3.977 6.98e-05 ***
## Siblings                  -0.655098   0.134485  -4.871 1.11e-06 ***
## CabinAssignmentUnassigned -0.787387   0.353103  -2.230  0.02575 *  
## EmbarkedQ                 -0.113587   0.392621  -0.289  0.77235    
## EmbarkedS                 -0.485940   0.246626  -1.970  0.04880 *  
## NumChildren               -0.397312   0.159939  -2.484  0.01299 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  719.17  on 878  degrees of freedom
## AIC: 745.17
## 
## Number of Fisher Scoring iterations: 5
# How'd we do?
# now calculate the training set error
correctRate = mean( predictions == newTrain$Survived );
# scored a 0.8395062

# now, how'd we do against the test set?
testSetProbs = predict( best.glm.fit, newTest, type="response");
bestTestPredictions = as.numeric( testSetProbs >= threshold );
bestLogisticRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=predictions);
bestLogisticRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=bestTestPredictions);

bestLogisticRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=trainSetProbs);
bestLogisticRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=testSetProbs);

write.csv( bestLogisticRegressionModelTrainProbs, "LogisticRegressionSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLogisticRegressionModelProbs, "LogisticRegressionSigModelProbs.csv", row.names=FALSE, quote=FALSE );

write.csv( bestLogisticRegressionModelTrain, "LogisticRegressionSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLogisticRegressionModel, "LogisticRegressionSigModel.csv", row.names=FALSE, quote=FALSE );
# scored a 0.77990 - not as good as the full model

The model based on significant features scored lower on the test set than the full model. To reduce the model fit to noise in the training data, the regularization methods of ridge regression and lasso are considered.

Logistic Regression with Ridge Regression Regularization

Ridge regression introduces a coefficient penalty that reduces the effect of the covariates. This can help with variance by making the solution slightly less flexible. The glmnet package includes a method for determining the lambda penalty coefficient using cross validation (default is 10-fold cross validation).

Logistic Ridge Regression on full feature set

library( glmnet, quietly = TRUE );
## Loaded glmnet 2.0-13
lambdas = c();
set.seed(1);
# use ridge regression
xTrain = model.matrix( Survived~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain,family=binomial);
xTest = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTest,family=binomial);
#xTrain = model.matrix( Survived~.,data=full_df,family=binomial);
full_df_test = newTest[, all_features];
#xTest = model.matrix( ~.,data=full_df_test,family=binomial);
cv.out = cv.glmnet( xTrain, newTrain$Survived, alpha=0, type.measure="class");
plot( cv.out );

lambdaMin = cv.out$lambda.min;
rrmodel = glmnet( xTrain, newTrain$Survived, alpha=0);
train.ridge.pred = predict( rrmodel, s=lambdaMin, newx=xTrain);
trainRidgePredictions = as.numeric( train.ridge.pred >= threshold );
numCorrect = sum(trainRidgePredictions == newTrain$Survived);
accuracy = numCorrect/nrow(newTrain);
ridge.pred = predict( rrmodel, s=lambdaMin, newx=xTest);
fullRidgePredictions = as.numeric( ridge.pred >= threshold );
fullRidgeRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=trainRidgePredictions);
fullRidgeRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=fullRidgePredictions);
fullRidgeRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=train.ridge.pred);
fullRidgeRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=ridge.pred);
write.csv( fullRidgeRegressionModelTrainProbs, "RidgeRegressionFullModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullRidgeRegressionModelProbs, "RidgeRegressionFullModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullRidgeRegressionModelTrain, "RidgeRegressionFullModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( fullRidgeRegressionModel, "RidgeRegressionFullModel.csv", row.names=FALSE, quote=FALSE );
# score is 0.78947

The training set mean squared error is the smallest at a lambda value of 0.02931267. Against the training data, the model achieved an accuracy of 0.8226712.

confusionMatrix(trainRidgePredictions, newTrain$Survived);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 482  91
##          1  67 251
##                                          
##                Accuracy : 0.8227         
##                  95% CI : (0.796, 0.8472)
##     No Information Rate : 0.6162         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.6201         
##  Mcnemar's Test P-Value : 0.06728        
##                                          
##             Sensitivity : 0.8780         
##             Specificity : 0.7339         
##          Pos Pred Value : 0.8412         
##          Neg Pred Value : 0.7893         
##              Prevalence : 0.6162         
##          Detection Rate : 0.5410         
##    Detection Prevalence : 0.6431         
##       Balanced Accuracy : 0.8059         
##                                          
##        'Positive' Class : 0              
## 

When comparing to the full logistic regression solution, the ridge predictions are the same as the least-squares logistic regression in 398 of the 418 cases (95.2% of the test set). The model scored a 0.78468 upon submission to the Kaggle site - slightly less than the full logistic regression solution (see comparison below).

fullLogisticPredictions = bestTestPredictions;
confusionMatrix( fullRidgePredictions, fullLogisticPredictions );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 245  14
##          1   6 153
##                                           
##                Accuracy : 0.9522          
##                  95% CI : (0.9271, 0.9705)
##     No Information Rate : 0.6005          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8995          
##  Mcnemar's Test P-Value : 0.1175          
##                                           
##             Sensitivity : 0.9761          
##             Specificity : 0.9162          
##          Pos Pred Value : 0.9459          
##          Neg Pred Value : 0.9623          
##              Prevalence : 0.6005          
##          Detection Rate : 0.5861          
##    Detection Prevalence : 0.6196          
##       Balanced Accuracy : 0.9461          
##                                           
##        'Positive' Class : 0               
## 

Next, a ridge regression model is fit using only the significant features determined from the logistic regression solution. From the training data, lambda was determined to be 0.02931267. Against the training data, the model achieved an accuracy of 0.8249158.

Logistic Ridge Regression on significant features

library( glmnet );
lambdas = c();
set.seed(1);
# use ridge regression
xTrain = model.matrix( Survived~.,data=best_df,family=binomial);
best_df_test = newTest[, best_features];
xTest = model.matrix( ~.,data=best_df_test,family=binomial);
cv.out = cv.glmnet( xTrain, newTrain$Survived, alpha=0, type.measure="class");
plot( cv.out );

lambdaMin = cv.out$lambda.min;
rrmodel = glmnet( xTrain, newTrain$Survived, alpha=0);
train.ridge.pred = predict( rrmodel, s=lambdaMin, newx=xTrain);
trainRidgePredictions = as.numeric( train.ridge.pred >= threshold );
numCorrect = sum(trainRidgePredictions == newTrain$Survived);
accuracy = numCorrect/nrow(newTrain);
ridge.pred = predict( rrmodel, s=lambdaMin, newx=xTest);
bestRidgePredictions = as.numeric( ridge.pred >= threshold );
bestRidgeRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=trainRidgePredictions);
bestRidgeRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=bestRidgePredictions);

bestRidgeRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=as.numeric(train.ridge.pred));
bestRidgeRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=as.numeric(ridge.pred));
write.csv( bestRidgeRegressionModelTrainProbs, "RidgeRegressionSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bestRidgeRegressionModelProbs, "RidgeRegressionSigModelProbs.csv", row.names=FALSE, quote=FALSE );

write.csv( bestRidgeRegressionModelTrain, "RidgeRegressionSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( bestRidgeRegressionModel, "RidgeRegressionSigModel.csv", row.names=FALSE, quote=FALSE );

When comparing to the full logistic regression solution, the best feature ridge predictions are the same as the least-squares logistic regression in 399 of the 418 cases (95.4% of the test set). The model scored a 0.77990 upon submission to the Kaggle site - which is less than both the full logistic regression and full ridge regression models.

fullLogisticPredictions = bestTestPredictions;
confusionMatrix( bestRidgePredictions, fullLogisticPredictions );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 244  14
##          1   7 153
##                                           
##                Accuracy : 0.9498          
##                  95% CI : (0.9242, 0.9686)
##     No Information Rate : 0.6005          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8946          
##  Mcnemar's Test P-Value : 0.1904          
##                                           
##             Sensitivity : 0.9721          
##             Specificity : 0.9162          
##          Pos Pred Value : 0.9457          
##          Neg Pred Value : 0.9563          
##              Prevalence : 0.6005          
##          Detection Rate : 0.5837          
##    Detection Prevalence : 0.6172          
##       Balanced Accuracy : 0.9441          
##                                           
##        'Positive' Class : 0               
## 

Logistic Regression with Lasso Regularization

Similar to ridge regression, lasso regularization introduces a penalty to reduce the magnitude of coefficient estimates. As with ridge regression, the glmnet implementation of the lasso uses cross validation to estimate the best value of lambda. The Lasso regularization method multiplies lambda with the absolute value of the coefficients as the penalty and, unlike Ridge Regression, can be used to eliminate features altogether.

Lasso using all features

# on to the lasso
xTrain = model.matrix( Survived~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain,family=binomial);
xTest = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTest,family=binomial);
cv.out = cv.glmnet( xTrain, newTrain$Survived, alpha=1, type.measure = "class");
plot( cv.out );

lambdaMin = cv.out$lambda.min;
lassomodel = glmnet( xTrain, newTrain$Survived, alpha=1);
train.lasso.pred = predict( lassomodel, s=lambdaMin, newx=xTrain);
trainLassoPredictions = as.numeric( train.lasso.pred >= threshold );
numCorrect = sum(trainLassoPredictions == newTrain$Survived);
accuracy = numCorrect/nrow(newTrain);
lasso.pred = predict( lassomodel, s=lambdaMin, newx=xTest);
lassoPredictions = as.numeric( lasso.pred >= threshold );
fullLassoRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=trainLassoPredictions);
fullLassoRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=lassoPredictions);
fullLassoRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=as.numeric(train.lasso.pred));
fullLassoRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=as.numeric(lasso.pred));
write.csv( fullLassoRegressionModelTrainProbs, "LassoRegressionFullModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLassoRegressionModelProbs, "LassoRegressionFullModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLassoRegressionModelTrain, "LassoRegressionFullModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( fullLassoRegressionModel, "LassoRegressionFullModel.csv", row.names=FALSE, quote=FALSE );

In the plot, as Lambda increases, the number of features in the model is reduced (see top axis). In this application, no features were eliminated. The Lasso model using all of the features had a training set error of 0.1638608 with a lambda of 0.00160114 (see confusion matrix below).

confusionMatrix( trainLassoPredictions, newTrain$Survived );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 489  86
##          1  60 256
##                                           
##                Accuracy : 0.8361          
##                  95% CI : (0.8102, 0.8599)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6485          
##  Mcnemar's Test P-Value : 0.03854         
##                                           
##             Sensitivity : 0.8907          
##             Specificity : 0.7485          
##          Pos Pred Value : 0.8504          
##          Neg Pred Value : 0.8101          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5488          
##    Detection Prevalence : 0.6453          
##       Balanced Accuracy : 0.8196          
##                                           
##        'Positive' Class : 0               
## 

When comparing to the full logistic regression model against the test data, the lasso model differed from the full logistic regression model in 11 cases, predicting the same outcome in 407 of 419 rows. Against the Kaggle test data, the model scored 0.78947.

confusionMatrix( lassoPredictions, fullLogisticPredictions );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 247   6
##          1   4 161
##                                           
##                Accuracy : 0.9761          
##                  95% CI : (0.9564, 0.9885)
##     No Information Rate : 0.6005          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.95            
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.9841          
##             Specificity : 0.9641          
##          Pos Pred Value : 0.9763          
##          Neg Pred Value : 0.9758          
##              Prevalence : 0.6005          
##          Detection Rate : 0.5909          
##    Detection Prevalence : 0.6053          
##       Balanced Accuracy : 0.9741          
##                                           
##        'Positive' Class : 0               
## 

Lasso on significant features

The lasso model with only the significant features achieved a training set error of 0.1627385 using a lambda of 0.0002067951.

# on to the lasso
xTrain = model.matrix( Survived~.,data=best_df,family=binomial);
best_df_test = newTest[, best_features];
xTest = model.matrix( ~.,data=best_df_test,family=binomial);
cv.out = cv.glmnet( xTrain, newTrain$Survived, alpha=1, type.measure = "class");
plot( cv.out );

lambdaMin = cv.out$lambda.min;
lassomodel = glmnet( xTrain, newTrain$Survived, alpha=1);
train.lasso.pred = predict( lassomodel, s=lambdaMin, newx=xTrain);
trainLassoPredictions = as.numeric( train.lasso.pred >= threshold );
numCorrect = sum(trainLassoPredictions == newTrain$Survived);
accuracy = numCorrect/nrow(newTrain);
lasso.pred = predict( lassomodel, s=lambdaMin, newx=xTest);
lassoPredictions = as.numeric( lasso.pred >= threshold );
bestLassoRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=trainLassoPredictions);
bestLassoRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=lassoPredictions);

bestLassoRegressionModelTrainProbs = data.frame(PassengerId=train$PassengerId, Survived=as.numeric(train.lasso.pred));
bestLassoRegressionModelProbs = data.frame(PassengerId=test$PassengerId, Survived=as.numeric(lasso.pred));

write.csv( bestLassoRegressionModelTrainProbs, "LassoRegressionSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLassoRegressionModelProbs, "LassoRegressionSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLassoRegressionModelTrain, "LassoRegressionSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( bestLassoRegressionModel, "LassoRegressionSigModel.csv", row.names=FALSE, quote=FALSE );

The “best” feature Lasso model had a training set error of 0.1616162 with a lambda of 0.0005754197 (see confusion matrix below).

confusionMatrix( trainLassoPredictions, newTrain$Survived );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 490  86
##          1  59 256
##                                           
##                Accuracy : 0.8373          
##                  95% CI : (0.8114, 0.8609)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6508          
##  Mcnemar's Test P-Value : 0.03084         
##                                           
##             Sensitivity : 0.8925          
##             Specificity : 0.7485          
##          Pos Pred Value : 0.8507          
##          Neg Pred Value : 0.8127          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5499          
##    Detection Prevalence : 0.6465          
##       Balanced Accuracy : 0.8205          
##                                           
##        'Positive' Class : 0               
## 

When comparing to the full logistic regression model against the test data, the “best feature” lasso model differed from the full logistic regression model in 11 cases, predicting the same outcome in 407 of 419 rows. Against the Kaggle test data, the model scored 0.79425.

confusionMatrix( lassoPredictions, fullLogisticPredictions );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 248   8
##          1   3 159
##                                           
##                Accuracy : 0.9737          
##                  95% CI : (0.9534, 0.9868)
##     No Information Rate : 0.6005          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9449          
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.9880          
##             Specificity : 0.9521          
##          Pos Pred Value : 0.9688          
##          Neg Pred Value : 0.9815          
##              Prevalence : 0.6005          
##          Detection Rate : 0.5933          
##    Detection Prevalence : 0.6124          
##       Balanced Accuracy : 0.9701          
##                                           
##        'Positive' Class : 0               
## 

The best feature lasso model differed from the full logistic regression model in 11 cases and from the best feature ridge regression model in 12 cases (see confusion matrices below).

confusionMatrix( lassoPredictions, fullLogisticPredictions );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 248   8
##          1   3 159
##                                           
##                Accuracy : 0.9737          
##                  95% CI : (0.9534, 0.9868)
##     No Information Rate : 0.6005          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9449          
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.9880          
##             Specificity : 0.9521          
##          Pos Pred Value : 0.9688          
##          Neg Pred Value : 0.9815          
##              Prevalence : 0.6005          
##          Detection Rate : 0.5933          
##    Detection Prevalence : 0.6124          
##       Balanced Accuracy : 0.9701          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix( bestRidgePredictions, lassoPredictions );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 251   7
##          1   5 155
##                                           
##                Accuracy : 0.9713          
##                  95% CI : (0.9504, 0.9851)
##     No Information Rate : 0.6124          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9394          
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.9805          
##             Specificity : 0.9568          
##          Pos Pred Value : 0.9729          
##          Neg Pred Value : 0.9688          
##              Prevalence : 0.6124          
##          Detection Rate : 0.6005          
##    Detection Prevalence : 0.6172          
##       Balanced Accuracy : 0.9686          
##                                           
##        'Positive' Class : 0               
## 
# score is 0.78947

K-Nearest Neighbors Models

K-Nearest Neighbors is a conceptionally simple method of assigning the most common class label to a point based on the labels of the closest K points. The lower the K, the higher the variance (at an extreme, K=1 will assign the class label based on the closest labeled point). Conversely, the higher the K, the lower the variance and higher the bias. With K as N-1 as an extreme, each point will be labeled as the most common class. This method lends itself nicely to cross-validation to find the best value of K.

KNN using all features

Using Leave One Out Cross Validation, the KNN model was trained using all features. The model was cross-validated using 20 different K-values from 5 to 43, odd numbered. The model achieved a training set accuracy of 0.8563 with a K of 9.

library(caret);
trainCtrl = trainControl( method="LOOCV");
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_fit
## k-Nearest Neighbors 
## 
## 891 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (16), scaled (16) 
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 890, 890, 890, 890, 890, 890, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8193042  0.6152323
##    7  0.8204265  0.6157019
##    9  0.8249158  0.6248907
##   11  0.8136925  0.5990534
##   13  0.8125701  0.5950474
##   15  0.8114478  0.5937668
##   17  0.8080808  0.5844114
##   19  0.8002245  0.5661743
##   21  0.8148148  0.5971739
##   23  0.8204265  0.6087150
##   25  0.8215488  0.6104966
##   27  0.8215488  0.6104966
##   29  0.8114478  0.5896171
##   31  0.8092031  0.5837867
##   33  0.8125701  0.5899655
##   35  0.8114478  0.5877454
##   37  0.8103255  0.5850549
##   39  0.8136925  0.5917222
##   41  0.8103255  0.5836299
##   43  0.8069585  0.5769651
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.
plot( knn_fit );

test_pred = predict( knn_fit, newdata=newTrain);
confusionMatrix( test_pred, newTrain$SurvivedFactor)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 495  74
##          1  54 268
##                                           
##                Accuracy : 0.8563          
##                  95% CI : (0.8316, 0.8787)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6929          
##  Mcnemar's Test P-Value : 0.09308         
##                                           
##             Sensitivity : 0.9016          
##             Specificity : 0.7836          
##          Pos Pred Value : 0.8699          
##          Neg Pred Value : 0.8323          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5556          
##    Detection Prevalence : 0.6386          
##       Balanced Accuracy : 0.8426          
##                                           
##        'Positive' Class : 0               
## 

Against the Kaggle test data, the model scored 0.76555.

# now the whole training set
knnTrain = newTrain;
knnTest = newTest;
trainCtrl = trainControl( method="LOOCV" );
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=knnTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_train_pred = predict( knn_fit, newdata=knnTrain);
knn_pred = predict( knn_fit, newdata=knnTest);
fullKNNModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(knn_train_pred)-1);
fullKNNModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(knn_pred)-1);
write.csv( fullKNNModelTrain, "KNNFullModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullKNNModel, "KNNFullModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( fullKNNModelTrain, "KNNFullModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( fullKNNModel, "KNNFullModel.csv", row.names=FALSE, quote=FALSE );

KNN with subsets of significant features

Since some features may not be informative and all features are used in finding closest points, it may be better to consider removing features that just add noise. The idea is to start with all significant features and then consider combinations as subsets for analysis.

significant_features = c("FixedTitle", "Sex", "Siblings", "FixedAge", "CabinAssignment", "pclass", "NumChildren", "FarePerPassenger", "Embarked");
all_combinations_of_significant_features = lapply(1:length(significant_features), function(x) combn(length(significant_features),x));
numFeatureSetClasses = length( all_combinations_of_significant_features );
correct = list();
accuracies = list();
errors = list();
features = list();
numFeatures = list();
k = list();
counter = 1;
numTrain = nrow( newTrain );
trainCtrl = trainControl( method="repeatedcv", number = 8, repeats = 1 );
for ( featureClass in 1:numFeatureSetClasses){
  featClassColumns = dim( all_combinations_of_significant_features[[featureClass]] )[2];
  for ( thisCol in 1:featClassColumns ){
    feature_set = significant_features[ all_combinations_of_significant_features[[featureClass]][,thisCol]];
    nFeatures = length( feature_set );
    df = newTrain[, c("SurvivedFactor", feature_set)]
    knn_fit = train( SurvivedFactor~.,data=df,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
    knn_pred = predict( knn_fit, newdata=newTrain);
    numRight = sum( knn_pred == newTrain$SurvivedFactor );
    cv.error = ( numTrain - numRight )/numTrain;
    accuracy = mean( knn_pred == newTrain$SurvivedFactor );
    features[counter] = list( feature_set );
    numFeatures[counter] = nFeatures;
    accuracies[counter] = accuracy;
    correct[counter] = numRight;
    errors[counter] = cv.error;
    k[counter] = knn_fit$bestTune$k;
    counter = counter + 1;
  }
}
cvResults = data.frame( NumFeatures=unlist(numFeatures), CVError=unlist(errors), TrainingScore=unlist(accuracies), NumCorrect=unlist(correct), K=unlist(k), Features=I(features) );
cvResults = cvResults[order(cvResults$CVError),];
knitr::kable( head(cvResults,n=20L), format="markdown", longtable=TRUE);
NumFeatures CVError TrainingScore NumCorrect K Features
504 8 0.1189675 0.8810325 785 5 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger, Embarked
387 6 0.1234568 0.8765432 781 5 FixedTitle, Sex, Siblings, FixedAge, pclass, FarePerPassenger
467 7 0.1234568 0.8765432 781 5 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger
472 7 0.1245791 0.8754209 780 5 FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren, FarePerPassenger
502 8 0.1268238 0.8731762 778 5 FixedTitle, Sex, Siblings, FixedAge, CabinAssignment, pclass, NumChildren, FarePerPassenger
506 8 0.1268238 0.8731762 778 5 FixedTitle, Sex, Siblings, FixedAge, pclass, NumChildren, FarePerPassenger, Embarked
444 6 0.1290685 0.8709315 776 5 Sex, Siblings, FixedAge, pclass, NumChildren, FarePerPassenger
507 8 0.1290685 0.8709315 776 5 FixedTitle, Sex, Siblings, CabinAssignment, pclass, NumChildren, FarePerPassenger, Embarked
331 5 0.1301908 0.8698092 775 5 Sex, Siblings, FixedAge, pclass, FarePerPassenger
403 6 0.1301908 0.8698092 775 5 FixedTitle, Sex, FixedAge, CabinAssignment, pclass, FarePerPassenger
481 7 0.1301908 0.8698092 775 5 FixedTitle, Sex, FixedAge, CabinAssignment, pclass, NumChildren, FarePerPassenger
496 7 0.1301908 0.8698092 775 5 Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger, Embarked
418 6 0.1313131 0.8686869 774 5 FixedTitle, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger
483 7 0.1313131 0.8686869 774 5 FixedTitle, Sex, FixedAge, CabinAssignment, pclass, FarePerPassenger, Embarked
389 6 0.1324355 0.8675645 773 5 FixedTitle, Sex, Siblings, FixedAge, NumChildren, FarePerPassenger
391 6 0.1335578 0.8664422 772 5 FixedTitle, Sex, Siblings, FixedAge, FarePerPassenger, Embarked
410 6 0.1335578 0.8664422 772 5 FixedTitle, Sex, FixedAge, pclass, FarePerPassenger, Embarked
420 6 0.1335578 0.8664422 772 5 FixedTitle, Siblings, FixedAge, CabinAssignment, NumChildren, FarePerPassenger
439 6 0.1335578 0.8664422 772 5 Sex, Siblings, FixedAge, CabinAssignment, pclass, FarePerPassenger
441 6 0.1335578 0.8664422 772 5 Sex, Siblings, FixedAge, CabinAssignment, NumChildren, FarePerPassenger

Now, take the top 10 results and repeat the analysis using LOOCV. LOOCV provides a better estimate of the error but is computationally costly.

correct = list();
accuracies = list();
errors = list();
features = list();
numFeatures = list();
k = list();
counter = 1;
numTrain = nrow( newTrain );
trainCtrl = trainControl( method="LOOCV" );
for ( rowNum in 1:10 ){
  feature_set = cvResults$Features[[rowNum]];
  nFeatures = cvResults$NumFeatures[rowNum];
  df = newTrain[, c("SurvivedFactor", feature_set)]
  knn_fit = train( SurvivedFactor~.,data=df,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
  knn_pred = predict( knn_fit, newdata=newTrain);
  numRight = sum( knn_pred == newTrain$SurvivedFactor );
  cv.error = ( numTrain - numRight )/numTrain;
  accuracy = mean( knn_pred == newTrain$SurvivedFactor );
  features[counter] = list( feature_set );
  numFeatures[counter] = nFeatures;
  accuracies[counter] = accuracy;
  correct[counter] = numRight;
  errors[counter] = cv.error;
  k[counter] = knn_fit$bestTune$k;
  counter = counter + 1;
  cat( counter, "\n")
}
loocvResults = data.frame( NumFeatures=unlist(numFeatures), CVError=unlist(errors), TrainingScore=unlist(accuracies), NumCorrect=unlist(correct), K=unlist(k), Features=I(features) );
loocvResults = loocvResults[order(loocvResults$CVError),];
knitr::kable( head(loocvResults,n=10L), format="markdown", longtable=TRUE);

Now, take the best 9, 8, 7, and 6 knn models.

KNN with with best subset of 9 significant features

With 9 features, the model achieves a training set accuracy of 0.8754 using a cross-validated K of 5. On the test set, the model scored 0.76076.

# best 9
knnTrain = newTrain;
knnTest = newTest;
trainCtrl = trainControl( method="LOOCV" );
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumChildren+Siblings,data=knnTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_fit
## k-Nearest Neighbors 
## 
## 891 samples
##   9 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 890, 890, 890, 890, 890, 890, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8282828  0.6314868
##    7  0.8249158  0.6232068
##    9  0.8002245  0.5681306
##   11  0.8159371  0.6029930
##   13  0.8193042  0.6109118
##   15  0.8013468  0.5736808
##   17  0.7979798  0.5647501
##   19  0.7946128  0.5592293
##   21  0.8013468  0.5722446
##   23  0.7991021  0.5669249
##   25  0.7946128  0.5557490
##   27  0.7923681  0.5493689
##   29  0.7923681  0.5478336
##   31  0.7912458  0.5430542
##   33  0.7957351  0.5518579
##   35  0.8024691  0.5666318
##   37  0.8013468  0.5639201
##   39  0.7979798  0.5562749
##   41  0.8013468  0.5639201
##   43  0.8002245  0.5622081
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 5.
plot( knn_fit )

# training set performance
knn_train_pred = predict( knn_fit, newdata=knnTrain);
confusionMatrix(knn_train_pred, knnTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 502  65
##          1  47 277
##                                           
##                Accuracy : 0.8743          
##                  95% CI : (0.8507, 0.8954)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7316          
##  Mcnemar's Test P-Value : 0.1082          
##                                           
##             Sensitivity : 0.9144          
##             Specificity : 0.8099          
##          Pos Pred Value : 0.8854          
##          Neg Pred Value : 0.8549          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5634          
##    Detection Prevalence : 0.6364          
##       Balanced Accuracy : 0.8622          
##                                           
##        'Positive' Class : 0               
## 
# test set performance
knn_pred = predict( knn_fit, newdata=knnTest);
best9KNNModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(knn_train_pred)-1);
best9KNNModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(knn_pred)-1);
write.csv( best9KNNModelTrain, "KNNBest9ModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best9KNNModel, "KNNBest9ModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best9KNNModelTrain, "KNNBest9ModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( best9KNNModel, "KNNBest9Model.csv", row.names=FALSE, quote=FALSE );

KNN with with best subset of 8 significant features

With 8 features, the model achieves a training set accuracy of 0.881 using a cross-validated K of 5. On the test set, the model scored 0.73684.

# best 8
trainCtrl = trainControl( method="LOOCV" );
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+Siblings,data=knnTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_fit
## k-Nearest Neighbors 
## 
## 891 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (12), scaled (12) 
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 890, 890, 890, 890, 890, 890, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8372615  0.6499720
##    7  0.8249158  0.6227834
##    9  0.8080808  0.5853479
##   11  0.8181818  0.6078345
##   13  0.8170595  0.6078398
##   15  0.8103255  0.5924934
##   17  0.8002245  0.5676432
##   19  0.7968575  0.5601080
##   21  0.8058361  0.5766826
##   23  0.8047138  0.5730220
##   25  0.8035915  0.5683534
##   27  0.7957351  0.5503145
##   29  0.7934905  0.5438017
##   31  0.7867565  0.5256468
##   33  0.7867565  0.5261965
##   35  0.7867565  0.5289256
##   37  0.7957351  0.5518579
##   39  0.8002245  0.5627078
##   41  0.7979798  0.5567826
##   43  0.8013468  0.5644187
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 5.
plot( knn_fit )

# training set performance
knn_train_pred = predict( knn_fit, newdata=knnTrain);
confusionMatrix(knn_train_pred, knnTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 506  65
##          1  43 277
##                                           
##                Accuracy : 0.8788          
##                  95% CI : (0.8555, 0.8995)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7406          
##  Mcnemar's Test P-Value : 0.04331         
##                                           
##             Sensitivity : 0.9217          
##             Specificity : 0.8099          
##          Pos Pred Value : 0.8862          
##          Neg Pred Value : 0.8656          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5679          
##    Detection Prevalence : 0.6409          
##       Balanced Accuracy : 0.8658          
##                                           
##        'Positive' Class : 0               
## 
# test set performance
knn_pred = predict( knn_fit, newdata=knnTest);
best8KNNModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(knn_train_pred)-1);
best8KNNModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(knn_pred)-1);
write.csv( best8KNNModelTrain, "KNNBest8ModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best8KNNModel, "KNNBest8ModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best8KNNModelTrain, "KNNBest8ModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( best8KNNModel, "KNNBest8Model.csv", row.names=FALSE, quote=FALSE );

KNN with with best subset of 7 significant features

With 7 features, the model achieves a training set accuracy of 0.8765 using a cross-validated K of 5. On the test set, the model scored 0.73684.

# best 7
trainCtrl = trainControl( method="LOOCV" );
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Siblings,data=knnTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_fit
## k-Nearest Neighbors 
## 
## 891 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 890, 890, 890, 890, 890, 890, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8327722  0.6431181
##    7  0.8148148  0.6034713
##    9  0.8193042  0.6104744
##   11  0.8170595  0.6069611
##   13  0.8260382  0.6287470
##   15  0.8226712  0.6234473
##   17  0.8058361  0.5860944
##   19  0.8103255  0.5943115
##   21  0.8002245  0.5724683
##   23  0.8013468  0.5760530
##   25  0.7957351  0.5633482
##   27  0.7991021  0.5712627
##   29  0.8013468  0.5760530
##   31  0.7968575  0.5650200
##   33  0.7890011  0.5464249
##   35  0.7968575  0.5625778
##   37  0.7968575  0.5625778
##   39  0.7676768  0.5052987
##   41  0.7789001  0.5323134
##   43  0.7766554  0.5254715
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 5.
plot( knn_fit )

# training set performance
knn_train_pred = predict( knn_fit, newdata=knnTrain);
confusionMatrix(knn_train_pred, knnTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 499  61
##          1  50 281
##                                           
##                Accuracy : 0.8754          
##                  95% CI : (0.8519, 0.8964)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.735           
##  Mcnemar's Test P-Value : 0.3425          
##                                           
##             Sensitivity : 0.9089          
##             Specificity : 0.8216          
##          Pos Pred Value : 0.8911          
##          Neg Pred Value : 0.8489          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5600          
##    Detection Prevalence : 0.6285          
##       Balanced Accuracy : 0.8653          
##                                           
##        'Positive' Class : 0               
## 
# test set performance
knn_pred = predict( knn_fit, newdata=knnTest);
best7KNNModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(knn_train_pred)-1);
best7KNNModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(knn_pred)-1);
write.csv( best7KNNModelTrain, "KNNBest7ModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best7KNNModel, "KNNBest7ModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best7KNNModelTrain, "KNNBest7ModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( best7KNNModel, "KNNBest7Model.csv", row.names=FALSE, quote=FALSE );

KNN with with best subset of 6 significant features

With 6 features, the model achieves a training set accuracy of 0.8765 using a cross-validated K of 5. On the test set, the model scored 0.71770.

# best 6
trainCtrl = trainControl( method="LOOCV" );
knn_fit = train( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+Siblings,data=knnTrain,method="knn", trControl=trainCtrl, preProcess=c("center","scale"), tuneLength=20 );
knn_fit
## k-Nearest Neighbors 
## 
## 891 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (9), scaled (9) 
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 890, 890, 890, 890, 890, 890, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8439955  0.6674400
##    7  0.8204265  0.6182581
##    9  0.8204265  0.6144109
##   11  0.8193042  0.6126516
##   13  0.8271605  0.6321656
##   15  0.8271605  0.6309368
##   17  0.8136925  0.6039427
##   19  0.7968575  0.5650200
##   21  0.7923681  0.5563979
##   23  0.7845118  0.5383270
##   25  0.7845118  0.5372931
##   27  0.7811448  0.5308513
##   29  0.7833895  0.5356631
##   31  0.7878788  0.5468065
##   33  0.7957351  0.5643196
##   35  0.7957351  0.5633482
##   37  0.7901235  0.5485845
##   39  0.7901235  0.5485845
##   41  0.7744108  0.5169559
##   43  0.7732884  0.5196483
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 5.
plot( knn_fit )

# training set performance
knn_train_pred = predict( knn_fit, newdata=knnTrain);
confusionMatrix(knn_train_pred, knnTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 497  59
##          1  52 283
##                                           
##                Accuracy : 0.8754          
##                  95% CI : (0.8519, 0.8964)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7356          
##  Mcnemar's Test P-Value : 0.569           
##                                           
##             Sensitivity : 0.9053          
##             Specificity : 0.8275          
##          Pos Pred Value : 0.8939          
##          Neg Pred Value : 0.8448          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5578          
##    Detection Prevalence : 0.6240          
##       Balanced Accuracy : 0.8664          
##                                           
##        'Positive' Class : 0               
## 
# test set performance
knn_pred = predict( knn_fit, newdata=knnTest);
best6KNNModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(knn_train_pred)-1);
best6KNNModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(knn_pred)-1);
write.csv( best6KNNModelTrain, "KNNBest6ModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best6KNNModel, "KNNBest6ModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( best6KNNModelTrain, "KNNBest6ModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( best6KNNModel, "KNNBest6Model.csv", row.names=FALSE, quote=FALSE );

The test set score of 0.71770 is less than that of the simple Gender Submission Model (0.76555).

Decision Tree Models

Classification Tree

A classification tree is used to model the data. Since the features contain nominal data (Embarked, Sex, for example), a decision tree is appropriate. After the tree is fit, terminal nodes are pruned using cross validation.
The tree model scored 0.8507 against the training data and 0.7727 against the test data.

library(tree);
tree_model = tree( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain );
summary( tree_model )
## 
## Classification tree:
## tree(formula = SurvivedFactor ~ pclass + Sex + FixedAge + FarePerPassenger + 
##     FixedTitle + CabinAssignment + Embarked + NumPassengersOnTicket + 
##     NumParents + NumChildren + SpousesFactor + Siblings, data = newTrain)
## Variables actually used in tree construction:
## [1] "FixedTitle"            "FarePerPassenger"      "pclass"               
## [4] "NumPassengersOnTicket" "Sex"                  
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7585 = 669 / 882 
## Misclassification error rate: 0.1493 = 133 / 891
cv_tree_model = cv.tree( tree_model, FUN=prune.misclass);
best_terminal_node_idx = which.min(cv_tree_model$dev);
best_terminal_nodes = cv_tree_model$size[best_terminal_node_idx];
pruned_tree = prune.misclass(tree_model,best=best_terminal_nodes);
plot( pruned_tree );
text( pruned_tree, pretty=0);

# training set performance against pruned tree
pruned_tree_train_pred = predict( pruned_tree, newTrain, type="class");
confusionMatrix(pruned_tree_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 507  91
##          1  42 251
##                                           
##                Accuracy : 0.8507          
##                  95% CI : (0.8256, 0.8735)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6757          
##  Mcnemar's Test P-Value : 3.153e-05       
##                                           
##             Sensitivity : 0.9235          
##             Specificity : 0.7339          
##          Pos Pred Value : 0.8478          
##          Neg Pred Value : 0.8567          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5690          
##    Detection Prevalence : 0.6712          
##       Balanced Accuracy : 0.8287          
##                                           
##        'Positive' Class : 0               
## 
# test set performance
tree_pred = predict( pruned_tree, newTest, type="class");
treeModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(pruned_tree_train_pred)-1);
treeModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(tree_pred)-1);
write.csv( treeModelTrain, "TreeModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModel, "TreeModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModelTrain, "TreeModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModel, "TreeModel.csv", row.names=FALSE, quote=FALSE );

Tree model with significant features

Next, a tree model using only significant features is considered. Again, cross-validation is used to prune the tree. The model achieved an accuracy of 0.8541 on the training data and scored 0.7608 against the test data.

library(tree);
tree_model = tree( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumChildren+Siblings,data=newTrain );
summary( tree_model )
## 
## Classification tree:
## tree(formula = SurvivedFactor ~ pclass + Sex + FixedAge + FarePerPassenger + 
##     FixedTitle + CabinAssignment + Embarked + NumChildren + Siblings, 
##     data = newTrain)
## Variables actually used in tree construction:
## [1] "FixedTitle"       "FarePerPassenger" "pclass"          
## [4] "Siblings"         "FixedAge"         "Sex"             
## Number of terminal nodes:  10 
## Residual mean deviance:  0.7536 = 663.9 / 881 
## Misclassification error rate: 0.1459 = 130 / 891
cv_tree_model = cv.tree( tree_model, FUN=prune.misclass);
best_terminal_node_idx = which.min(cv_tree_model$dev);
best_terminal_nodes = cv_tree_model$size[best_terminal_node_idx];
pruned_tree = prune.misclass(tree_model,best=best_terminal_nodes);
plot( pruned_tree );
text( pruned_tree, pretty=0);

# training set performance against pruned tree
pruned_tree_train_pred = predict( pruned_tree, newTrain, type="class");
confusionMatrix(pruned_tree_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 510  91
##          1  39 251
##                                           
##                Accuracy : 0.8541          
##                  95% CI : (0.8292, 0.8766)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6824          
##  Mcnemar's Test P-Value : 7.713e-06       
##                                           
##             Sensitivity : 0.9290          
##             Specificity : 0.7339          
##          Pos Pred Value : 0.8486          
##          Neg Pred Value : 0.8655          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5724          
##    Detection Prevalence : 0.6745          
##       Balanced Accuracy : 0.8314          
##                                           
##        'Positive' Class : 0               
## 
# test set performance
tree_pred = predict( pruned_tree, newTest, type="class");
# scored 0.7608
treeModelTrain = data.frame(PassengerId=knnTrain$PassengerId, Survived=as.numeric(pruned_tree_train_pred)-1);
treeModel = data.frame(PassengerId=knnTest$PassengerId, Survived=as.numeric(tree_pred)-1);
write.csv( treeModelTrain, "TreeSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModel, "TreeSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModelTrain, "TreeSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( treeModel, "TreeSigModel.csv", row.names=FALSE, quote=FALSE );

Bootstrap Aggregation (Bagging) Tree Models

Bootstrap Aggregation is a method of resampling the data many times to generate a series of samples from the original training set. This allows for the fitting of many trees. In Bagging tree models, the results of these trees are accumulated and the class that a particular sample assumes most often is the winner (by majority vote). The averaging of many trees reduces variance.

Bagging model using all features

The Bagging model scored an impressive 0.9877 on the training set but only 0.7321 against the test set. The high train set accuracy and relatively low test set accuracy is indicative of over-training.

library(randomForest);
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
set.seed(1);
bag_model = randomForest( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain, mtry=12, importance=TRUE );
bag_model
## 
## Call:
##  randomForest(formula = SurvivedFactor ~ pclass + Sex + FixedAge +      FarePerPassenger + FixedTitle + CabinAssignment + Embarked +      NumPassengersOnTicket + NumParents + NumChildren + SpousesFactor +      Siblings, data = newTrain, mtry = 12, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 17.96%
## Confusion matrix:
##     0   1 class.error
## 0 479  70   0.1275046
## 1  90 252   0.2631579
plot( bag_model )

# Get training set performance
bag_train_pred = predict( bag_model, newdata=newTrain);
confusionMatrix( bag_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 546   8
##          1   3 334
##                                          
##                Accuracy : 0.9877         
##                  95% CI : (0.978, 0.9938)
##     No Information Rate : 0.6162         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9738         
##  Mcnemar's Test P-Value : 0.2278         
##                                          
##             Sensitivity : 0.9945         
##             Specificity : 0.9766         
##          Pos Pred Value : 0.9856         
##          Neg Pred Value : 0.9911         
##              Prevalence : 0.6162         
##          Detection Rate : 0.6128         
##    Detection Prevalence : 0.6218         
##       Balanced Accuracy : 0.9856         
##                                          
##        'Positive' Class : 0              
## 
# now test set
bag_pred = predict( bag_model, newdata=newTest);
bagModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(bag_train_pred)-1);
bagModel = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(bag_pred)-1);
write.csv( bagModelTrain, "BagModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModel, "BagModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModelTrain, "BagModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModel, "BagModel.csv", row.names=FALSE, quote=FALSE );
# scored 0.73205

Plot shows the OOB error (black) as well as the class errors (0/Died = red, 1/Survived = green).

Bagging model using only significant features

Like the full featured model, the significant feature bagging model scored an impressive 0.9854 on the training set but only 0.7273 against the test set. The high train set accuracy and relatively low test set accuracy is indicative of over-training.

library(randomForest);
set.seed(1);
bag_model = randomForest( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumChildren+Siblings,data=newTrain, mtry=9, importance=TRUE );
bag_model
## 
## Call:
##  randomForest(formula = SurvivedFactor ~ pclass + Sex + FixedAge +      FarePerPassenger + FixedTitle + CabinAssignment + Embarked +      NumChildren + Siblings, data = newTrain, mtry = 9, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 18.07%
## Confusion matrix:
##     0   1 class.error
## 0 481  68   0.1238616
## 1  93 249   0.2719298
plot( bag_model )

# Get training set performance
bag_train_pred = predict( bag_model, newdata=newTrain);
confusionMatrix( bag_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 544   8
##          1   5 334
##                                           
##                Accuracy : 0.9854          
##                  95% CI : (0.9752, 0.9922)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9691          
##  Mcnemar's Test P-Value : 0.5791          
##                                           
##             Sensitivity : 0.9909          
##             Specificity : 0.9766          
##          Pos Pred Value : 0.9855          
##          Neg Pred Value : 0.9853          
##              Prevalence : 0.6162          
##          Detection Rate : 0.6105          
##    Detection Prevalence : 0.6195          
##       Balanced Accuracy : 0.9838          
##                                           
##        'Positive' Class : 0               
## 
# now test set
bag_pred = predict( bag_model, newdata=newTest);
bagModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(bag_train_pred)-1);
bagModel = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(bag_pred)-1);
write.csv( bagModelTrain, "BagSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModel, "BagSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModelTrain, "BagSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( bagModel, "BagSigModel.csv", row.names=FALSE, quote=FALSE );
# scored 0.7273

Plot shows the OOB error (black) as well as the class errors (0/Died = red, 1/Survived = green).

Random Forest Models

A random forest is much like the bagging idea but has the constraint that each branch point is allowed a random selection of m features to use. This random selection of predictors decorrelates the trees from one another and creates a much more diverse set of trees than does bagging. As a rule of thumb, the number of predictors to choose from is set to the square root of the number of available predictors (for classification trees).

Random Forest Model using all features

Using the standard sqrt(predictors) = 3 for mtry resulted in another overfit model. But reducing the mtry to 1 (the number of predictors considered at each branch point), the random forest model scored 0.8575 against the training data and 0.7847 against the test data.

library(randomForest);
set.seed(1);
rf_model = randomForest( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain, mtry=1, importance=TRUE );
rf_model
## 
## Call:
##  randomForest(formula = SurvivedFactor ~ pclass + Sex + FixedAge +      FarePerPassenger + FixedTitle + CabinAssignment + Embarked +      NumPassengersOnTicket + NumParents + NumChildren + SpousesFactor +      Siblings, data = newTrain, mtry = 1, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 17.62%
## Confusion matrix:
##     0   1 class.error
## 0 506  43  0.07832423
## 1 114 228  0.33333333
importance( rf_model );
##                               0          1 MeanDecreaseAccuracy
## pclass                 7.731820 13.2196507            14.943597
## Sex                   15.218208 21.1711598            19.614457
## FixedAge               9.195339  9.0631091            13.216547
## FarePerPassenger       8.601746 11.9774619            15.137825
## FixedTitle            14.129828 18.4765974            18.439136
## CabinAssignment       10.028316  8.7620830            13.199128
## Embarked               4.265670  7.0988848             8.023670
## NumPassengersOnTicket 12.110813  8.7586309            15.309226
## NumParents            10.283945  5.0274583            12.075322
## NumChildren            5.132610  3.2312246             6.550804
## SpousesFactor          4.192165  0.6604533             4.439363
## Siblings              14.902568  2.2274841            14.908893
##                       MeanDecreaseGini
## pclass                       13.317381
## Sex                          35.371304
## FixedAge                     12.770873
## FarePerPassenger             18.733637
## FixedTitle                   36.625948
## CabinAssignment              10.341210
## Embarked                      5.687204
## NumPassengersOnTicket        12.360933
## NumParents                    3.886940
## NumChildren                   3.043912
## SpousesFactor                 2.148494
## Siblings                      6.499895
varImpPlot( rf_model );

plot( rf_model )

# training set performance
rf_train_pred = predict( rf_model, newdata=newTrain);
confusionMatrix( rf_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 524 102
##          1  25 240
##                                           
##                Accuracy : 0.8575          
##                  95% CI : (0.8328, 0.8798)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6853          
##  Mcnemar's Test P-Value : 1.542e-11       
##                                           
##             Sensitivity : 0.9545          
##             Specificity : 0.7018          
##          Pos Pred Value : 0.8371          
##          Neg Pred Value : 0.9057          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5881          
##    Detection Prevalence : 0.7026          
##       Balanced Accuracy : 0.8281          
##                                           
##        'Positive' Class : 0               
## 
# test set performance
rf_pred = predict( rf_model, newdata=newTest);
rfModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(rf_train_pred)-1);
rfModel = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(rf_pred)-1);
write.csv( rfModelTrain, "RandomForestModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModelTrain, "RandomForestModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestModel.csv", row.names=FALSE, quote=FALSE );
# scored 0.7847

Random Forest Model using only significant features

Using the standard sqrt(predictors) = 3 for mtry resulted in another overfit model. But reducing the mtry to 1 (the number of predictors considered at each branch point), the random forest model scored 0.8608 against the training data and 0.7823 against the test data.

library(randomForest);
set.seed(1);
rf_model = randomForest( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumChildren+Siblings,data=newTrain, importance=TRUE, mtry = 1 );
rf_model
## 
## Call:
##  randomForest(formula = SurvivedFactor ~ pclass + Sex + FixedAge +      FarePerPassenger + FixedTitle + CabinAssignment + Embarked +      NumChildren + Siblings, data = newTrain, importance = TRUE,      mtry = 1) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 17.51%
## Confusion matrix:
##     0   1 class.error
## 0 502  47   0.0856102
## 1 109 233   0.3187135
importance( rf_model );
##                          0         1 MeanDecreaseAccuracy MeanDecreaseGini
## pclass            9.333196 14.819291            17.015375        16.386503
## Sex              16.023715 24.064663            22.274436        43.866722
## FixedAge         12.755309 10.706769            16.336319        15.238069
## FarePerPassenger  8.468047 11.854202            16.431479        21.697107
## FixedTitle       15.963874 20.103090            20.571212        48.303153
## CabinAssignment  10.567478  7.894701            13.556124        12.139451
## Embarked          4.083765  9.727112             9.901060         5.880477
## NumChildren       6.777913  2.930898             7.832526         3.771433
## Siblings         16.768459  3.949508            16.413947         7.036073
varImpPlot( rf_model );

plot( rf_model )

# training set performance
rf_train_pred = predict( rf_model, newdata=newTrain);
confusionMatrix( rf_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 516  91
##          1  33 251
##                                           
##                Accuracy : 0.8608          
##                  95% CI : (0.8363, 0.8829)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6961          
##  Mcnemar's Test P-Value : 3.076e-07       
##                                           
##             Sensitivity : 0.9399          
##             Specificity : 0.7339          
##          Pos Pred Value : 0.8501          
##          Neg Pred Value : 0.8838          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5791          
##    Detection Prevalence : 0.6813          
##       Balanced Accuracy : 0.8369          
##                                           
##        'Positive' Class : 0               
## 
# test set performance
rf_pred = predict( rf_model, newdata=newTest);
rfModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(rf_train_pred)-1);
rfModel = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(rf_pred)-1);
write.csv( rfModelTrain, "RandomForestSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModelTrain, "RandomForestSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestSigModel.csv", row.names=FALSE, quote=FALSE );
# scored 0.7823

Random Forest using only 9 selected features

The Random forest model using all features generated a variable importance plot. The variable set that looked to be most informative is selected and used in this model. This model scored 0.8485 on the training set and 0.7775 on the test set.

library(randomForest);
set.seed(1);
rf_model = randomForest( SurvivedFactor~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+NumParents+Siblings+NumPassengersOnTicket,data=newTrain, importance=TRUE, mtry=1 );
rf_model
## 
## Call:
##  randomForest(formula = SurvivedFactor ~ pclass + Sex + FixedAge +      FarePerPassenger + FixedTitle + CabinAssignment + NumParents +      Siblings + NumPassengersOnTicket, data = newTrain, importance = TRUE,      mtry = 1) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 16.5%
## Confusion matrix:
##     0   1 class.error
## 0 493  56   0.1020036
## 1  91 251   0.2660819
importance( rf_model );
##                               0         1 MeanDecreaseAccuracy
## pclass                 8.319264 11.920284             15.35410
## Sex                   16.526483 23.762523             22.64204
## FixedAge              10.121820  9.609194             15.55474
## FarePerPassenger       8.057060 14.282136             17.34949
## FixedTitle            16.500111 22.101773             22.35903
## CabinAssignment        8.616449  6.996768             12.01374
## NumParents            10.059492  4.893490             11.73343
## Siblings              15.261830  4.924502             16.26732
## NumPassengersOnTicket 12.293240 10.961389             16.89836
##                       MeanDecreaseGini
## pclass                       15.668979
## Sex                          40.127918
## FixedAge                     14.342667
## FarePerPassenger             23.537830
## FixedTitle                   48.522611
## CabinAssignment              11.741865
## NumParents                    4.153769
## Siblings                      7.916815
## NumPassengersOnTicket        15.004035
varImpPlot( rf_model );

plot( rf_model )

# training set performance
rf_train_pred = predict( rf_model, newdata=newTrain);
confusionMatrix( rf_train_pred, newTrain$SurvivedFactor);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 499  85
##          1  50 257
##                                           
##                Accuracy : 0.8485          
##                  95% CI : (0.8232, 0.8714)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6734          
##  Mcnemar's Test P-Value : 0.003431        
##                                           
##             Sensitivity : 0.9089          
##             Specificity : 0.7515          
##          Pos Pred Value : 0.8545          
##          Neg Pred Value : 0.8371          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5600          
##    Detection Prevalence : 0.6554          
##       Balanced Accuracy : 0.8302          
##                                           
##        'Positive' Class : 0               
## 
# test set performance
rf_pred = predict( rf_model, newdata=newTest);
rfModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(rf_train_pred)-1);
rfModel = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(rf_pred)-1);
write.csv( rfModelTrain, "RandomForestTrimmedModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestTrimmedModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModelTrain, "RandomForestTrimmedModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( rfModel, "RandomForestTrimmedModel.csv", row.names=FALSE, quote=FALSE );
# scored 0.7775

Boosted Classification Tree Models

Gradient Boosting applied to classification trees involves fitting classification trees sequentially. That is, instead of generating a bunch of trees and averaging them as with bagging and random forests, boosting involves adding trees sequentially into a single model. Trees are successively fit to the residuals of the previous tree ensemble.

Gradient Boosting using all features

Since there are many parameters to consider for boosted classification trees, cross-validation is used to select the values which will generalize best. This analysis is done using all predictors.

# This is not run; ignore
library(gbm, quietly = TRUE);
tree_lengths = 1:4;
best_error = Inf;
best_idx = 0;
min_error_idx = tree_lengths;
min_errors = tree_lengths;
xTrain = newTrain[,c("Survived","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = newTest[,c("Survived","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
numSamples = 10;
trainingProportion = 0.8;
threshold = 0.5;
cvFolds = 8;  # n-fold validation
nTrees = 15000; 
treeDiv = 1000;
numIntTrees = nTrees/treeDiv - 1;
nCores = 7; # adjust according to computer

trainMat = matrix( nrow=numSamples, ncol=numIntTrees);
testMat = matrix( nrow=numSamples, ncol=numIntTrees);
for ( depth in tree_lengths ){
  set.seed(depth);
  sampleSet = createDataPartition(1:nrow(newTrain),numSamples,p=trainingProportion);
  for ( samp in 1:numSamples){
    trainSample = sampleSet[[samp]];
    testSample = setdiff(1:nrow(newTrain),trainSample);
    boosted_model = gbm( formula=Survived~.,data=xTrain[trainSample,], distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = depth, n.cores = nCores, verbose=FALSE );
    best_tree_num = gbm.perf( boosted_model );
    train_scores = 0;
    test_scores = 0;
    for ( ntree in 1:numIntTrees){
      treelen = (ntree-1)*treeDiv;
      # now that the model is trained, see how it does on training set for n trees.
      train_pred = predict( boosted_model, newdata=xTrain[trainSample,], n.trees=treelen, type="response");
      train_predictions = as.numeric( train_pred >= threshold );
      train_score = sum( train_predictions == newTrain$Survived[trainSample] )/length(trainSample);
      train_scores[ ntree ] = train_score;
      trainMat[samp,ntree] = train_score;
      cat( "trainMat[", samp, ",", ntree, "] = ", train_score, "\n");
      # now that the model is trained, see how it does on test set for n trees.
      test_pred = predict( boosted_model, newdata=xTrain[testSample,], n.trees=treelen, type="response");
      test_predictions = as.numeric( test_pred >= threshold );
      test_score = sum( test_predictions == newTrain$Survived[testSample] )/length(testSample);
      test_scores[ ntree ] = test_score;
      testMat[samp,ntree] = test_score;
    }
  }
  boosted_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = depth, n.cores = nCores, verbose=FALSE );
  min_error_idx[depth] = which.min( boosted_model$cv.error );
  min_errors[depth] = min( boosted_model$cv.error );
  if ( min_errors[depth] < best_error ){
    best_error = min_errors[depth];
    best_idx = depth;
  }
}
gbtrees = data.frame( TreeDepth = tree_lengths, BernoulliDeviance = min_errors, NumTrees = min_error_idx );
knitr::kable( gbtrees, format="markdown", longtable=TRUE);
#summary( boosted_model );
#plot( boosted_model );
set.seed(best_idx);
best_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = best_idx, n.cores = nCores, verbose=FALSE );
gbm.perf( best_model );
boosted_train_pred = predict( best_model, newdata=xTrain, n.trees=min_error_idx[best_idx], type="response");
boosted_pred = predict( best_model, newdata=xTest, n.trees=min_error_idx[best_idx], type="response");
# probabilities >= 0.5 mean survived, < 0.5 mean perished
boosted_train_predictions = as.numeric( boosted_train_pred >= threshold );
boosted_predictions = as.numeric( boosted_pred >= threshold );
boostedModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=boosted_train_predictions);
boostedModel = data.frame(PassengerId=newTest$PassengerId, Survived=boosted_predictions);
boostedModelTrainProbs = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(boosted_train_pred));
boostedModelProbs = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(boosted_pred));
write.csv( boostedModelTrainProbs, "BoostedTreeModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelProbs, "BoostedTreeModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelTrain, "BoostedTreeModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModel, "BoostedTreeModel.csv", row.names=FALSE, quote=FALSE );
# originally scored 0.76076 with interaction depth =4, 0.77990 with interaction depth = 1
# after 10-fold x-validation, considering interaction depths of 1 to 4, scored a 0.76555
# same model , stopping at 4800 trees yields 0.77511
library(gbm, quietly = TRUE);
## 
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
## 
##     aml
## The following object is masked from 'package:caret':
## 
##     cluster
## Loaded gbm 2.1.3
tree_lengths = 1:10;
numLengths = length( tree_lengths );
best_error = Inf;
best_idx = 0;
best_shrinkage = 0;
best_steps = 0;
xTrain = newTrain[,c("Survived","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = newTest[,c("Survived","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
learningRates = c(0.1,0.07,0.05,0.03,0.02,0.01);
numLR = length( learningRates );
threshold = 0.5;
cvFolds = 10;  # n-fold validation
nTrees = 3000; 
nCores = 7; # adjust according to computer
tree_depths = rep(0,numLengths*numLR);
steps = rep(0,numLengths*numLR);
deviances = rep(0,numLengths*numLR);
shrinkages = rep(0,numLengths*numLR);
idx = 1;

for ( depth in tree_lengths ){
  set.seed(depth);
  for ( lrIdx in 1:numLR){
      learningRate = learningRates[lrIdx];
      boosted_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = depth, shrinkage = learningRate, n.cores = nCores, verbose=FALSE );
      idxCVErr = gbm.perf( boosted_model, plot.it=FALSE );
      cvError = boosted_model$cv.error[idxCVErr];
      if ( cvError < best_error ){
        best_error = cvError;
        best_idx = depth;
        best_shrinkage = learningRate;
        best_steps = idxCVErr;
      }
      tree_depths[idx] = depth;
      shrinkages[idx] = learningRate;
      steps[idx] = idxCVErr;
      deviances[idx] = cvError;
      idx = idx + 1;
  }
}
gbtrees = data.frame( TreeDepth = tree_depths, Shrinkage = shrinkages, NumTrees = steps, BernoulliDeviance = deviances );
knitr::kable( gbtrees, format="markdown", longtable=TRUE);
TreeDepth Shrinkage NumTrees BernoulliDeviance
1 0.10 122 0.8408372
1 0.07 372 0.8402510
1 0.05 761 0.8413770
1 0.03 2239 0.8308918
1 0.02 2813 0.8343378
1 0.01 2943 0.8302995
2 0.10 124 0.8040676
2 0.07 464 0.7921007
2 0.05 483 0.8071318
2 0.03 448 0.8084767
2 0.02 1324 0.7910884
2 0.01 958 0.8213471
3 0.10 109 0.8066694
3 0.07 181 0.8009394
3 0.05 333 0.7897554
3 0.03 417 0.7885938
3 0.02 519 0.8010698
3 0.01 1471 0.7991779
4 0.10 139 0.7834033
4 0.07 108 0.7969865
4 0.05 127 0.8078885
4 0.03 355 0.7872869
4 0.02 543 0.7930899
4 0.01 911 0.7926584
5 0.10 68 0.8016500
5 0.07 167 0.7922652
5 0.05 287 0.7891908
5 0.03 186 0.8009506
5 0.02 499 0.7874025
5 0.01 568 0.8113521
6 0.10 46 0.7915514
6 0.07 67 0.8015155
6 0.05 85 0.8021446
6 0.03 174 0.8002787
6 0.02 320 0.7954187
6 0.01 478 0.8045433
7 0.10 40 0.7902456
7 0.07 101 0.7838852
7 0.05 131 0.7943580
7 0.03 154 0.7881931
7 0.02 273 0.7987344
7 0.01 519 0.7876161
8 0.10 64 0.8074978
8 0.07 55 0.7850531
8 0.05 118 0.7819934
8 0.03 265 0.7907267
8 0.02 432 0.7838358
8 0.01 479 0.7964976
9 0.10 31 0.7919973
9 0.07 65 0.8089280
9 0.05 113 0.7894301
9 0.03 132 0.7841065
9 0.02 241 0.7879951
9 0.01 494 0.7839586
10 0.10 40 0.7854725
10 0.07 56 0.7856801
10 0.05 108 0.7876004
10 0.03 128 0.8020566
10 0.02 253 0.7824585
10 0.01 358 0.7931526
#summary( boosted_model );
#plot( boosted_model );
# create a line chart
xrange = range( shrinkages );
yrange = range( deviances );
plot( xrange, yrange, type="n", xlab="Shrinkage", ylab="CV Accuracy (Bernoulli Deviance)");
colors = rainbow( numLengths );
linetype = tree_lengths;
plotchar = seq(0,numLengths,1);
# add lines
for ( i in tree_lengths){
  thisDepthTrees = subset( gbtrees, gbtrees$TreeDepth == i );
  lines(thisDepthTrees$Shrinkage, thisDepthTrees$BernoulliDeviance, type="b", lwd=1.5, lty=linetype[i], col=colors[i], pch=plotchar[i]);
}
title( "Bernoulli Deviance by Shrinkage and Tree Depth");
legend( 0.08, yrange[2], 1:numLengths, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Tree Depth");

set.seed(best_idx);
best_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = best_idx, shrinkage = best_shrinkage, n.cores = nCores, verbose=FALSE );
gbm.perf( best_model );

## [1] 92
boosted_train_pred = predict( best_model, newdata=xTrain, n.trees=best_steps, type="response");
boosted_train_calls = as.numeric( boosted_train_pred >= threshold );
confusionMatrix(boosted_train_calls,newTrain$Survived);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 513  59
##          1  36 283
##                                           
##                Accuracy : 0.8934          
##                  95% CI : (0.8712, 0.9129)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7717          
##  Mcnemar's Test P-Value : 0.024           
##                                           
##             Sensitivity : 0.9344          
##             Specificity : 0.8275          
##          Pos Pred Value : 0.8969          
##          Neg Pred Value : 0.8871          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5758          
##    Detection Prevalence : 0.6420          
##       Balanced Accuracy : 0.8810          
##                                           
##        'Positive' Class : 0               
## 
boosted_pred = predict( best_model, newdata=xTest, n.trees=best_steps, type="response");
# probabilities >= 0.5 mean survived, < 0.5 mean perished
boosted_predictions = as.numeric( boosted_pred >= threshold );
boostedModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=boosted_train_calls);
boostedModel = data.frame(PassengerId=newTest$PassengerId, Survived=boosted_predictions);
boostedModelTrainProbs = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(boosted_train_pred));
boostedModelProbs = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(boosted_pred));
write.csv( boostedModelTrainProbs, "BoostedTreeModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelProbs, "BoostedTreeModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelTrain, "BoostedTreeModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModel, "BoostedTreeModel.csv", row.names=FALSE, quote=FALSE );
# interaction.depth = 8, nTrees = 118, learningRate = 0.05
# scored 0.756

The Cross-validated parameters (interaction depth = 8, number of trees = 118, learning rate = 0.05) resulted in a training set accuracy of 0.8943 on the training set and 0.756 on the test set.

Gradient Boosting using only significant features

The same technique is used as above to find the best set of parameters for gradient boosting model for the significant features.

library(gbm);
tree_lengths = 1:10;
numLengths = length( tree_lengths );
best_error = Inf;
best_idx = 0;
best_shrinkage = 0;
best_steps = 0;
significant_features = c("FixedTitle", "Sex", "Siblings", "FixedAge", "CabinAssignment", "pclass", "NumChildren", "FarePerPassenger", "Embarked");
xTrain = newTrain[,c("Survived", significant_features) ];
xTest = newTest[,c("Survived", significant_features) ];
learningRates = c(0.1,0.07,0.05,0.03,0.02,0.01);
numLR = length( learningRates );
threshold = 0.5;
cvFolds = 10;  # n-fold validation
nTrees = 3000; 
nCores = 7; # adjust according to computer
tree_depths = rep(0,numLengths*numLR);
steps = rep(0,numLengths*numLR);
deviances = rep(0,numLengths*numLR);
shrinkages = rep(0,numLengths*numLR);
idx = 1;

for ( depth in tree_lengths ){
  set.seed(depth);
  for ( lrIdx in 1:numLR){
      learningRate = learningRates[lrIdx];
      boosted_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = depth, shrinkage = learningRate, n.cores = nCores, verbose=FALSE );
      idxCVErr = gbm.perf( boosted_model, plot.it=FALSE );
      cvError = boosted_model$cv.error[idxCVErr];
      if ( cvError < best_error ){
        best_error = cvError;
        best_idx = depth;
        best_shrinkage = learningRate;
        best_steps = idxCVErr;
      }
      tree_depths[idx] = depth;
      shrinkages[idx] = learningRate;
      steps[idx] = idxCVErr;
      deviances[idx] = cvError;
      idx = idx + 1;
  }
}
gbtrees = data.frame( TreeDepth = tree_depths, Shrinkage = shrinkages, NumTrees = steps, BernoulliDeviance = deviances );
knitr::kable( gbtrees, format="markdown", longtable=TRUE);
TreeDepth Shrinkage NumTrees BernoulliDeviance
1 0.10 153 0.8408302
1 0.07 571 0.8382404
1 0.05 304 0.8458554
1 0.03 995 0.8296206
1 0.02 629 0.8401464
1 0.01 1897 0.8301282
2 0.10 126 0.8073095
2 0.07 422 0.7910109
2 0.05 334 0.8082076
2 0.03 449 0.8146400
2 0.02 1265 0.7942554
2 0.01 972 0.8232836
3 0.10 100 0.7984287
3 0.07 149 0.7976263
3 0.05 238 0.7966997
3 0.03 317 0.7924063
3 0.02 508 0.8024181
3 0.01 1474 0.8006754
4 0.10 138 0.7866196
4 0.07 94 0.8003936
4 0.05 144 0.8054965
4 0.03 350 0.7944821
4 0.02 760 0.7988662
4 0.01 911 0.8000948
5 0.10 112 0.8077974
5 0.07 168 0.7915250
5 0.05 182 0.7860657
5 0.03 252 0.8017584
5 0.02 346 0.7924162
5 0.01 559 0.8152680
6 0.10 55 0.7973603
6 0.07 97 0.8045107
6 0.05 85 0.8081989
6 0.03 186 0.7906270
6 0.02 282 0.7930570
6 0.01 426 0.8056687
7 0.10 40 0.7941372
7 0.07 86 0.7867753
7 0.05 113 0.7992422
7 0.03 169 0.7881527
7 0.02 267 0.7961951
7 0.01 522 0.7967638
8 0.10 51 0.8050424
8 0.07 97 0.7838809
8 0.05 88 0.7897468
8 0.03 148 0.7863680
8 0.02 328 0.7818409
8 0.01 473 0.7996210
9 0.10 31 0.7961626
9 0.07 76 0.8101975
9 0.05 112 0.7938873
9 0.03 184 0.7894002
9 0.02 202 0.7953271
9 0.01 484 0.7881808
10 0.10 41 0.7866709
10 0.07 50 0.7920365
10 0.05 108 0.7915468
10 0.03 118 0.8051032
10 0.02 296 0.7828338
10 0.01 359 0.7974175
#summary( boosted_model );
#plot( boosted_model );
# create a line chart
xrange = range( shrinkages );
yrange = range( deviances );
plot( xrange, yrange, type="n", xlab="Shrinkage", ylab="Cross Validation Error (Bernoulli Deviance)");
colors = rainbow( numLengths );
linetype = tree_lengths;
plotchar = seq(0,numLengths,1);
# add lines
for ( i in tree_lengths){
  thisDepthTrees = subset( gbtrees, gbtrees$TreeDepth == i );
  lines(thisDepthTrees$Shrinkage, thisDepthTrees$BernoulliDeviance, type="b", lwd=1.5, lty=linetype[i], col=colors[i], pch=plotchar[i]);
}
title( "Bernoulli Deviance by Shrinkage and Tree Depth");
legend( 0.08, yrange[2], 1:numLengths, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Tree Depth");

set.seed(best_idx);
best_model = gbm( formula=Survived~.,data=xTrain, distribution="bernoulli", n.trees=nTrees, cv.folds=cvFolds, interaction.depth = best_idx, shrinkage = best_shrinkage, n.cores = nCores, verbose=FALSE );
gbm.perf( best_model );

## [1] 306
boosted_train_pred = predict( best_model, newdata=xTrain, n.trees=best_steps, type="response");
boosted_train_calls = as.numeric( boosted_train_pred >= threshold );
confusionMatrix(boosted_train_calls,newTrain$Survived);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 517  64
##          1  32 278
##                                         
##                Accuracy : 0.8923        
##                  95% CI : (0.87, 0.9119)
##     No Information Rate : 0.6162        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.7681        
##  Mcnemar's Test P-Value : 0.001557      
##                                         
##             Sensitivity : 0.9417        
##             Specificity : 0.8129        
##          Pos Pred Value : 0.8898        
##          Neg Pred Value : 0.8968        
##              Prevalence : 0.6162        
##          Detection Rate : 0.5802        
##    Detection Prevalence : 0.6521        
##       Balanced Accuracy : 0.8773        
##                                         
##        'Positive' Class : 0             
## 
boosted_pred = predict( best_model, newdata=xTest, n.trees=best_steps, type="response");
# probabilities >= 0.5 mean survived, < 0.5 mean perished
boosted_predictions = as.numeric( boosted_pred >= threshold );
boostedModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=boosted_train_calls);
boostedModel = data.frame(PassengerId=newTest$PassengerId, Survived=boosted_predictions);
boostedModelTrainProbs = data.frame(PassengerId=newTrain$PassengerId, Survived=as.numeric(boosted_train_pred));
boostedModelProbs = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(boosted_pred));
write.csv( boostedModelTrainProbs, "BoostedTreeSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelProbs, "BoostedTreeSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModelTrain, "BoostedTreeSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( boostedModel, "BoostedTreeSigModel.csv", row.names=FALSE, quote=FALSE );
# interaction.depth = 8, nTrees = 328, learningRate = 0.02
# scored 0.75119

The cross validated parameters were found to be an interaction depth of 8, 328 trees, and a learning rate of 0.02. The model achieved a score of 0.8923 on the training set and a sad 0.7368 on the test set.

Support Vector Machine (SVM) Model

Support Vector Machines seek to find the optimal separating hyperplane between two classes. There are several parameters to consider: cost (of a constraint violation), choice of kernel (and corresponding gamma). Cross-validation is used to estimate these parameters.

library(e1071);
library(caret);
#xTrain = newTrain[,c("pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
#yTrain = newTrain[,"SurvivedFactor"];
xTrain = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=whiteTrain);
yTrain = newTrain[,"SurvivedFactor"];
#xTest = newTest[,c("pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=whiteTest);
#xTrainSVM = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTrain);
#xTestSVM = model.matrix( ~pclass+Sex+FixedAge+FarePerPassenger+FixedTitle+CabinAssignment+Embarked+NumPassengersOnTicket+NumParents+NumChildren+SpousesFactor+Siblings,data=newTest);
costs = c(0.01, 0.1, 0.5, 0.9, 1, 1.1);
gammas = c(0.0001, 0.01, 0.05, 0.1, 0.5);
kernels = c("linear", "polynomial", "radial", "sigmoid");
bestCost = costs[1];
bestGamma = gammas[1];
bestKernel = kernels[1];
bestError = Inf;
errors = rep(Inf,1,length(kernels));
ctr = 1;
for ( kernel in kernels ){
  tune.out = tune( "svm", xTrain, train.y=yTrain, kernel=kernel, ranges=list( cost=costs,gamma=gammas) );
  tune.out$best.performance
  tune.out$best.model
  thisError = tune.out$best.performance;
  errors[ctr] = thisError;
  if ( thisError <  bestError ){
    bestError = thisError;
    bestKernel = kernel;
    bestGamma = tune.out$best.model$gamma;
    bestCost = tune.out$best.model$cost;
  }
  ctr = ctr + 1;
}
svm.fit = svm( x=xTrain, y=yTrain, kernel=bestKernel, cost=bestCost, gamma=bestGamma);
#plot( svm.fit, xTrain );
svm.train.pred = predict( svm.fit, xTrain );
svmTrainingPreds = as.numeric( svm.train.pred ) - 1;
truth = newTrain$Survived;
confusionMatrix( svmTrainingPreds, truth );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 490  84
##          1  59 258
##                                          
##                Accuracy : 0.8395         
##                  95% CI : (0.8137, 0.863)
##     No Information Rate : 0.6162         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.656          
##  Mcnemar's Test P-Value : 0.04475        
##                                          
##             Sensitivity : 0.8925         
##             Specificity : 0.7544         
##          Pos Pred Value : 0.8537         
##          Neg Pred Value : 0.8139         
##              Prevalence : 0.6162         
##          Detection Rate : 0.5499         
##    Detection Prevalence : 0.6442         
##       Balanced Accuracy : 0.8235         
##                                          
##        'Positive' Class : 0              
## 
# scores a 0.8294 on the training set
svm.test.pred = predict( svm.fit, xTest );  
svmPredictions = as.numeric( svm.test.pred ) - 1;
svmModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=svmTrainingPreds);
svmModel = data.frame(PassengerId=newTest$PassengerId, Survived=svmPredictions);

write.csv( svmModelTrain, "SVMModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( svmModel, "SVMModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( svmModelTrain, "SVMModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( svmModel, "SVMModel.csv", row.names=FALSE, quote=FALSE );
# scored a 0.7727

The SVM model scored a 0.8328 against the training set and a 0.7727 against the test set.

Neural Network Models

This neural network implementation uses a single layer. The number of hidden nodes and learning rate is selected through 10-fold cross validation.

Neural Network with one hidden layer utilizing all features

### NN Model
library(nnet);
#xTrain = newTrain[,c("SurvivedFactor", significant_features) ];
#xTest = newTest[,c("SurvivedFactor", significant_features) ];

xTrain = whiteTrain[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];
xTest = whiteTest[,c("SurvivedFactor","pclass","Sex","FixedAge","FarePerPassenger","FixedTitle","CabinAssignment","Embarked","NumPassengersOnTicket","NumParents","NumChildren","SpousesFactor","Siblings")];

nn = nnet( SurvivedFactor ~ ., data = xTrain, size=12, maxit=500, trace=FALSE);
# How do we do on the training data?

nn_train_pred_class = predict( nn, xTrain, type="class" );  # yields "0", "1"
nn_train_pred = as.numeric( nn_train_pred_class );   # transform to 0, 1
confusionMatrix(nn_train_pred,xTrain$Survived);

# try on test data
nn_test_pred_class = predict( nn, xTest, type="class" );  # yields "0", "1"
nn_test_pred = as.numeric( nn_test_pred_class );   # transform to 0, 1
# write to  file
nnModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=nn_train_pred);
nnModel = data.frame(PassengerId=newTest$PassengerId, Survived=nn_test_pred);
write.csv( nnModelTrain, "NeuralNetworkFullModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModel, "NeuralNetworkFullModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModelTrain, "NeuralNetworkFullModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModel, "NeuralNetworkFullModel.csv", row.names=FALSE, quote=FALSE );
# best "over"trained model scores 0.73205 (306/418)

# Now, we wish to select parameters for neural network by using 10-fold cross-validation with caret
tuningGrid = expand.grid(size=1:12,decay=c(0,0.0001,0.05,0.1));
set.seed( 1 );
trControl = trainControl( method="repeatedcv", number=10, repeats=10 );
nn_cv = train( SurvivedFactor ~ ., data=xTrain, method="nnet", trControl = trControl, tuneGrid=tuningGrid, verbose=FALSE, trace=FALSE);

best_size = nn_cv$bestTune[1,"size"];
best_decay = nn_cv$bestTune[1,"decay"];
best_nn = nnet( SurvivedFactor ~ ., data = xTrain, size=best_size, decay=best_decay, maxit=500, trace=FALSE);
best_nn_train_pred_class = predict( best_nn, newdata=xTrain, type="class" );  # yields "0", "1"
best_nn_train_pred = as.numeric( best_nn_train_pred_class );   # transform to 0, 1
confusionMatrix( best_nn_train_pred, xTrain$Survived);
# now do on test data
best_nn_test_pred_class = predict( best_nn, newdata=xTest, type="class" );  # yields "0", "1"
best_nn_test_pred = as.numeric( best_nn_test_pred_class );   # transform to 0, 1
# write to  file
nnModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=best_nn_train_pred);
nnModel = data.frame(PassengerId=newTest$PassengerId, Survived=best_nn_test_pred);
write.csv( nnModelTrain, "NeuralNetworkFullModelCVTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModel, "NeuralNetworkFullModelCVProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModelTrain, "NeuralNetworkFullModelCVTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModel, "NeuralNetworkFullModelCV.csv", row.names=FALSE, quote=FALSE );
# Neural network model scores 0.7847

The cross validated Neural Network model scored 0.8418 against the training set and 0.7847 against the test set.

Neural Network with one hidden layer utilizing significant features

### NN Model
library(nnet);
xTrain = whiteTrain[,c("SurvivedFactor", significant_features) ];
xTest = whiteTest[,c("SurvivedFactor", significant_features) ];

nn = nnet( SurvivedFactor ~ ., data = xTrain, size=length(significant_features), maxit=500, trace=FALSE);
# How do we do on the training data?

nn_train_pred_class = predict( nn, xTrain, type="class" );  # yields "0", "1"
nn_train_pred = as.numeric( nn_train_pred_class );   # transform to 0, 1
confusionMatrix(nn_train_pred,xTrain$Survived);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 531  60
##          1  18 282
##                                           
##                Accuracy : 0.9125          
##                  95% CI : (0.8919, 0.9302)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8105          
##  Mcnemar's Test P-Value : 3.445e-06       
##                                           
##             Sensitivity : 0.9672          
##             Specificity : 0.8246          
##          Pos Pred Value : 0.8985          
##          Neg Pred Value : 0.9400          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5960          
##    Detection Prevalence : 0.6633          
##       Balanced Accuracy : 0.8959          
##                                           
##        'Positive' Class : 0               
## 
# 0.9068 accuracy on the training set

# try on test data
nn_test_pred_class = predict( nn, xTest, type="class" );  # yields "0", "1"
nn_test_pred = as.numeric( nn_test_pred_class );   # transform to 0, 1
# write to  file
nnModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=nn_train_pred);
nnModel = data.frame(PassengerId=newTest$PassengerId, Survived=nn_test_pred);
write.csv( nnModelTrain, "NeuralNetworkSigModelTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModel, "NeuralNetworkSigModelProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModelTrain, "NeuralNetworkSigModelTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModel, "NeuralNetworkSigModel.csv", row.names=FALSE, quote=FALSE );
# best "over"trained model scores 0.72727 (304/418)

# Now, we wish to select parameters for neural network by using 10-fold cross-validation with caret
tuningGrid = expand.grid(size=1:length(significant_features),decay=c(0,0.0001,0.05,0.1));
set.seed( 1 );
trControl = trainControl( method="repeatedcv", number=10, repeats=10 );
nn_cv = train( SurvivedFactor ~ ., data=xTrain, method="nnet", trControl = trControl, tuneGrid=tuningGrid, verbose=FALSE, trace=FALSE);

best_size = nn_cv$bestTune[1,"size"];
best_decay = nn_cv$bestTune[1,"decay"];
best_nn = nnet( SurvivedFactor ~ ., data = xTrain, size=best_size, decay=best_decay, maxit=500, trace=FALSE);
best_nn_train_pred_class = predict( best_nn, newdata=xTrain, type="class" );  # yields "0", "1"
best_nn_train_pred = as.numeric( best_nn_train_pred_class );   # transform to 0, 1
confusionMatrix( best_nn_train_pred, xTrain$Survived);
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 512  94
##          1  37 248
##                                          
##                Accuracy : 0.853          
##                  95% CI : (0.828, 0.8756)
##     No Information Rate : 0.6162         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6791         
##  Mcnemar's Test P-Value : 9.944e-07      
##                                          
##             Sensitivity : 0.9326         
##             Specificity : 0.7251         
##          Pos Pred Value : 0.8449         
##          Neg Pred Value : 0.8702         
##              Prevalence : 0.6162         
##          Detection Rate : 0.5746         
##    Detection Prevalence : 0.6801         
##       Balanced Accuracy : 0.8289         
##                                          
##        'Positive' Class : 0              
## 
# now do on test data
best_nn_test_pred_class = predict( best_nn, newdata=xTest, type="class" );  # yields "0", "1"
best_nn_test_pred = as.numeric( best_nn_test_pred_class );   # transform to 0, 1
# write to  file
nnModelTrain = data.frame(PassengerId=newTrain$PassengerId, Survived=best_nn_train_pred);
nnModel = data.frame(PassengerId=newTest$PassengerId, Survived=best_nn_test_pred);
write.csv( nnModelTrain, "NeuralNetworkSigModelCVTrainProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModel, "NeuralNetworkSigModelCVProbs.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModelTrain, "NeuralNetworkSigModelCVTrain.csv", row.names=FALSE, quote=FALSE );
write.csv( nnModel, "NeuralNetworkSigModelCV.csv", row.names=FALSE, quote=FALSE );
# Neural network model scores 0.7847 (334/418) - same as full featured model

The cross validated Neural Network model with only significant features scored 0.853 against the training set and 0.7751 against the test set.

Ensemble Model

Ensemble models use multiple models to make predictions. Ensemble models usually fall into one of three different types: bagging, boosting, and stacking. In a bagging model paradigm, a single model (usually) is run multiple times over different subsamples of the data. These models are usually not complex to avoid overfitting and hopefully learn one part of the data and so are uncorrelated with one another. The output of the models are then combined to make a prediction. The Gradient Boosting model that we used with classification trees is an example of an ensemble model where successive models are employed to reduce the residuals of the previous model. Finally, a stacking ensemble model is much like a traditional machine learning model only the features or inputs to the learning model are themselves outputs of different types of machine learning models. The caveat, however, is that in order for ensembling to work, the models must be uncorrelated.

The idea in this section is to gather the results from all the trained models. Where possible, I used the probabilties of individual predictions (i.e., 0.983 chance of survival) over class calls (binary 0=dead, 1=survived) as that data is a little more fine-tuned. For this dataset, many of the models that I have trained are drastically overfit. The gradient boosted tree models, with training accuracies of 98% are extremely overfit. Other models are just not well-suited for this type of problem (KNN, LDA, QDA). Once I pared the list of models to just include the models that I thought were well-suited for this problem, I computed all of the cross-correlations between models and then enumerated all of the sets of models that were below a specified cross-correlation (I picked 0.75 to signify a high correlation). Once I did this, I found that all of the remaining models were highly correlated to one another, so I coudn’t make an ensemble! But since I’m determined to make an ensemble, I pressed on.

I added back the LDA, QDA, and neural network models and reran the correlation analysis. I purposely excluded the KNN and Bagging models since I believe KNN to not be well-suited for this problem and the bagging models are massively overfit. The correlation analysis showed which models could be combined into a single stacked model (see below). I manually selected the set of 3 models: RandomForest, Gender model, and QDA, since they represent three models of different complexity (simple: gender model, balanced: Random Forest, complex: QDA), hoping the strengths of each would complement each other.

In addition, I decided to use all the models in a different ensemble model, just taking row means. This set included all of the models, including the overfit bagging models, KNN, boosting, etc. Many of these models are highly correlated so the ensemble method is just a simple mean across all of the models in each row.

library(matlab,quietly=TRUE);
## 
## Attaching package: 'matlab'
## The following object is masked from 'package:stats':
## 
##     reshape
## The following objects are masked from 'package:utils':
## 
##     find, fix
## The following object is masked from 'package:base':
## 
##     sum
library(ROCR,quietly=TRUE);
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(tidyr,quietly=TRUE);
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:Matrix':
## 
##     expand
library(dplyr,quietly=TRUE);
library(tibble,quietly=TRUE);
## 
## Attaching package: 'tibble'
## The following object is masked from 'package:assertthat':
## 
##     has_name
all_train_models = c( "BagModelTrainProbs.csv", "BagSigModelTrainProbs.csv", "BoostedTreeModelTrainProbs.csv", "BoostedTreeSigModelTrainProbs.csv", "GenderModelTrainProbs.csv", "KNNBest6ModelTrainProbs.csv", "KNNBest7ModelTrainProbs.csv", "KNNBest8ModelTrainProbs.csv", "KNNBest9ModelTrainProbs.csv", "KNNFullModelTrainProbs.csv", "LassoRegressionFullModelTrainProbs.csv", "LassoRegressionSigModelTrainProbs.csv", "LDAModelTrainProbs.csv", "LDAReducedModelTrainProbs.csv", "LogisticRegressionFullModelTrainProbs.csv", "LogisticRegressionSigModelTrainProbs.csv", "NeuralNetworkSigModelCVTrainProbs.csv", "NeuralNetworkSigModelTrainProbs.csv", "NullModelTrainProbs.csv", "QDAModelTrainProbs.csv", "QDASigModelTrainProbs.csv", "RandomForestModelTrainProbs.csv", "RandomForestSigModelTrainProbs.csv", "RandomForestTrimmedModelTrainProbs.csv", "RidgeRegressionFullModelTrainProbs.csv", "RidgeRegressionSigModelTrainProbs.csv", "SVMModelTrainProbs.csv", "TreeModelTrainProbs.csv", "TreeSigModelTrainProbs.csv", "WCModelTrainProbs.csv" );

train_models = c( "BoostedTreeModelTrainProbs.csv", "BoostedTreeSigModelTrainProbs.csv", "GenderModelTrainProbs.csv", "LassoRegressionFullModelTrainProbs.csv", "LassoRegressionSigModelTrainProbs.csv", "LDAModelTrainProbs.csv", "LDAReducedModelTrainProbs.csv", "LogisticRegressionFullModelTrainProbs.csv", "LogisticRegressionSigModelTrainProbs.csv", "NeuralNetworkSigModelCVTrainProbs.csv", "NeuralNetworkSigModelTrainProbs.csv", "QDAModelTrainProbs.csv", "QDASigModelTrainProbs.csv", "RandomForestModelTrainProbs.csv", "RandomForestSigModelTrainProbs.csv", "RandomForestTrimmedModelTrainProbs.csv", "RidgeRegressionFullModelTrainProbs.csv", "RidgeRegressionSigModelTrainProbs.csv", "SVMModelTrainProbs.csv", "TreeModelTrainProbs.csv", "TreeSigModelTrainProbs.csv", "WCModelTrainProbs.csv" );

all_test_models = c( "BagModelProbs.csv", "BagSigModelProbs.csv", "BoostedTreeModelProbs.csv", "BoostedTreeSigModelProbs.csv", "GenderModelProbs.csv", "KNNBest6ModelProbs.csv", "KNNBest7ModelProbs.csv", "KNNBest8ModelProbs.csv", "KNNBest9ModelProbs.csv", "KNNFullModelProbs.csv", "LassoRegressionFullModelProbs.csv", "LassoRegressionSigModelProbs.csv", "LDAModelProbs.csv", "LDAReducedModelProbs.csv", "LogisticRegressionFullModelProbs.csv", "LogisticRegressionSigModelProbs.csv", "NeuralNetworkSigModelCVProbs.csv", "NeuralNetworkSigModelProbs.csv", "NullModelProbs.csv", "QDAModelProbs.csv", "QDASigModelProbs.csv", "RandomForestModelProbs.csv", "RandomForestSigModelProbs.csv", "RandomForestTrimmedModelProbs.csv", "RidgeRegressionFullModelProbs.csv", "RidgeRegressionSigModelProbs.csv", "SVMModelProbs.csv", "TreeModelProbs.csv", "TreeSigModelProbs.csv", "WCModelProbs.csv" );

test_models = c( "BoostedTreeModelProbs.csv", "BoostedTreeSigModelProbs.csv", "GenderModelProbs.csv", "LassoRegressionFullModelProbs.csv", "LassoRegressionSigModelProbs.csv", "LDAModelProbs.csv", "LDAReducedModelProbs.csv", "LogisticRegressionFullModelProbs.csv", "LogisticRegressionSigModelProbs.csv", "NeuralNetworkSigModelCVProbs.csv", "NeuralNetworkSigModelProbs.csv", "QDAModelProbs.csv", "QDASigModelProbs.csv", "RandomForestModelProbs.csv", "RandomForestSigModelProbs.csv", "RandomForestTrimmedModelProbs.csv", "RidgeRegressionFullModelProbs.csv", "RidgeRegressionSigModelProbs.csv", "SVMModelProbs.csv", "TreeModelProbs.csv", "TreeSigModelProbs.csv", "WCModelProbs.csv" );

trainModels = data.frame(PassengerId=1:nrow(newTrain), Survived=newTrain$Survived );
testModels = data.frame(PassengerId=1:nrow(newTest));

# read all the train model predictions and store in dataframe
numModels = length( train_models );
for ( i in 1:numModels ){
  trainFileName = train_models[i];
  fp = fileparts( trainFileName );
  modelName = gsub( "TrainProbs", "", fp$name );
  trainModelFrame = read.csv( trainFileName );
  trainModels[ modelName ] = trainModelFrame$Survived;
  # now for test
  testFileName = test_models[i];
  fp = fileparts( testFileName );
  modelName = gsub( "Probs", "", fp$name );
  testModelFrame = read.csv( testFileName );
  testModels[ modelName ] = testModelFrame$Survived;
}

allTrainModels = data.frame(PassengerId=1:nrow(newTrain), Survived=newTrain$Survived );
allTestModels = data.frame(PassengerId=1:nrow(newTest));

numAllModels = length( all_train_models );
for ( i in 1:numAllModels ){
  trainFileName = all_train_models[i];
  fp = fileparts( trainFileName );
  modelName = gsub( "TrainProbs", "", fp$name );
  trainModelFrame = read.csv( trainFileName );
  allTrainModels[ modelName ] = trainModelFrame$Survived;
  # now for test
  testFileName = all_test_models[i];
  fp = fileparts( testFileName );
  modelName = gsub( "Probs", "", fp$name );
  testModelFrame = read.csv( testFileName );
  allTestModels[ modelName ] = testModelFrame$Survived;
}

all_train_model_names = setdiff( names(allTrainModels), c("PassengerId", "Survived") );

# this calculates all correlations between models.  Some corr calcs break because
# they're all one value so standard deviation calc blows up.
modelCors <- trainModels[,!names(trainModels) %in% c("PassengerId","Survived")] %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'Model_A') %>%
  gather(Model_B, correlation, -Model_A)

corrThresh = 0.75;
modelNames = unique(modelCors$Model_A);

# this recursive function is intended to output groups of models that 
# have inter correlations below threshold
addToModelSet = function( modelSet, corrs, model, corrThresh ){
    #modelSet[ length(modelSet)+1 ] = model;
    if ( ! model %in% modelSet ){
        modelSet = union( modelSet, model );
    }
    idxUsed = c();
    idxUnused = 1:nrow(corrs);
   
    toUse = c();
    for ( idx in idxUnused ){
        thisModel = corrs$Model_A[idx];
        if ( thisModel %in% modelSet ){
            toUse[ length(toUse) + 1] = idx;
        }
    }
   
    for ( idx in toUse ){
        addModel = TRUE;
        modelA = corrs$Model_A[idx];
        modelB = corrs$Model_B[idx];
        correlation = corrs$correlation[idx];
        if ( correlation < corrThresh ){
            # now, check corr with every other member of modelSet
            for ( model in modelSet ){
                # want to check modelB corr with all models in modelSet
              idxModelA = which( corrs$Model_A == model );
              idxModelB = which( corrs$Model_B == modelB );
                if ( model == modelB ){
                    idxUsed[ length(idxUsed) + 1 ] = idx;
                    # correlations of the same model
                    next;
                }
              rightRowIdx = intersect( idxModelA, idxModelB );
              if( length( rightRowIdx ) == 1 && rightRowIdx < nrow(corrs) ){
                rightRow = corrs[ rightRowIdx, ];
              }
              else{
                next;
              }
                rightCorr = rightRow$correlation;
                if ( rightCorr > corrThresh ){
                    idxUsed[ length(idxUsed) + 1 ] = idx;
                    addModel = FALSE;
                    break;
                }
            }
            idxUsed[ length(idxUsed) + 1 ] = idx;
        }
        else{
          addModel = FALSE;
        }
        if ( addModel ){
          idxToUse = setdiff( idxUnused, idxUsed );
          newCorrs = corrs[idxToUse,]
          #newCorrs = corrs;
          modelSet = addToModelSet( modelSet, newCorrs, modelB, corrThresh);
        }
    }
    
    return( modelSet );
}

modelSets = list();
print( "Model sets that have inter-correlations less than threshold:")
## [1] "Model sets that have inter-correlations less than threshold:"
for ( model in modelNames ){
  modelSet = addToModelSet( c(), modelCors, model, corrThresh );
  cat( paste(paste(modelSet, collapse=", "),"\n"));
  modelSets[ length(modelSets) + 1 ] = list( modelSet );
}
## BoostedTreeModel, GenderModel 
## BoostedTreeSigModel, GenderModel 
## GenderModel, BoostedTreeModel 
## LassoRegressionFullModel 
## LassoRegressionSigModel 
## LDAModel, NeuralNetworkSigModel 
## LDAReducedModel, NeuralNetworkSigModel 
## LogisticRegressionFullModel 
## LogisticRegressionSigModel 
## NeuralNetworkSigModelCV, GenderModel, QDASigModel 
## NeuralNetworkSigModel, GenderModel, QDAModel 
## QDAModel, GenderModel, NeuralNetworkSigModel 
## QDASigModel, GenderModel, NeuralNetworkSigModelCV 
## RandomForestModel, GenderModel, QDASigModel 
## RandomForestSigModel 
## RandomForestTrimmedModel 
## RidgeRegressionSigModel 
## SVMModel, NeuralNetworkSigModel 
## TreeModel 
## TreeSigModel 
## WCModel, NeuralNetworkSigModel, QDAModel
# now go through each model and find the other models which are correlated below
# threshold (0.75)
# manually choose the set #14
uncorrelatedModelSet = modelSets[[14]];

# now for all models that have passed criteria, create a model
selectedTrainModels = trainModels[,c("Survived", uncorrelatedModelSet)];
ensemble_lr = glm( Survived~., data=selectedTrainModels, family=binomial);
summary( ensemble_lr );
## 
## Call:
## glm(formula = Survived ~ ., family = binomial, data = selectedTrainModels)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2024  -0.5514  -0.5514   0.4304   1.9793  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.8411     0.1749 -10.524  < 2e-16 ***
## RandomForestModel   3.3197     0.3210  10.341  < 2e-16 ***
## GenderModel         0.1817     0.5384   0.337  0.73578    
## QDASigModel         0.7193     0.2544   2.828  0.00469 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  712.48  on 887  degrees of freedom
## AIC: 720.48
## 
## Number of Fisher Scoring iterations: 4
trainSetProbs = predict( ensemble_lr, trainModels, type="response");  # now a vector of probabilities
# probabilities >= 0.5 mean survived, < 0.5 mean perished
threshold = 0.5;
selectedTrainPredictions = as.numeric( trainSetProbs >= threshold );
confusionMatrix( selectedTrainPredictions, trainModels$Survived );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 524 102
##          1  25 240
##                                           
##                Accuracy : 0.8575          
##                  95% CI : (0.8328, 0.8798)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6853          
##  Mcnemar's Test P-Value : 1.542e-11       
##                                           
##             Sensitivity : 0.9545          
##             Specificity : 0.7018          
##          Pos Pred Value : 0.8371          
##          Neg Pred Value : 0.9057          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5881          
##    Detection Prevalence : 0.7026          
##       Balanced Accuracy : 0.8281          
##                                           
##        'Positive' Class : 0               
## 
# check to see if threshold 0.5 is appropriate
rocr.pred = prediction( trainSetProbs, trainModels$Survived);
acc.perf = performance( rocr.pred, measure="acc");
plot( acc.perf, main="Manual Ensemble Logistic Regression Model Training Set Accuracy by Threshold" );

# threshold 0.5 is appropriate

# now, how'd we do against the test set?
testSetProbs = predict( ensemble_lr, testModels, type="response");
testPredictions = as.numeric( testSetProbs >= threshold );
ensembleModelManual = data.frame(PassengerId=newTest$PassengerId, Survived=testPredictions);
write.csv( ensembleModelManual, "EnsembleModelManual.csv", row.names=FALSE, quote=FALSE );

# now try just an ensemble model, taking means
ensembleTrainModels = allTrainModels[,all_train_model_names];
# simply take the means across the rows to get "probabilities" of survival for each passenger
ensembleTrainPredictions = rowMeans( ensembleTrainModels );
ensembleTrainCalls = round( ensembleTrainPredictions );
confusionMatrix( ensembleTrainCalls, allTrainModels$Survived );
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 509  79
##          1  40 263
##                                           
##                Accuracy : 0.8664          
##                  95% CI : (0.8423, 0.8881)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7114          
##  Mcnemar's Test P-Value : 0.000495        
##                                           
##             Sensitivity : 0.9271          
##             Specificity : 0.7690          
##          Pos Pred Value : 0.8656          
##          Neg Pred Value : 0.8680          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5713          
##    Detection Prevalence : 0.6599          
##       Balanced Accuracy : 0.8481          
##                                           
##        'Positive' Class : 0               
## 
# check to see if threshold 0.5 is appropriate
rocr.pred = prediction( ensembleTrainPredictions, allTrainModels$Survived);
acc.perf = performance( rocr.pred, measure="acc");
plot( acc.perf, main="All Means Ensemble Logistic Regression Model Training Set Accuracy by Threshold" );

# threshold 0.5 is appropriate

# How'd the ensemble model do against the test set?
ensembleTestModels = testModels[,modelNames];
# simply take the means across the rows to get "probabilities" of survival for each passenger
ensembleTestPredictions = rowMeans( ensembleTestModels );
ensembleTestCalls = round( ensembleTestPredictions );

#fullLogisticRegressionModelTrain = data.frame(PassengerId=train$PassengerId, Survived=predictions);
#fullLogisticRegressionModel = data.frame(PassengerId=test$PassengerId, Survived=fullLogisticPredictions);
# now, create a new test and train data frame with these models as predictors

# write to  file
ensembleModelMeans = data.frame(PassengerId=newTest$PassengerId, Survived=ensembleTestCalls);
write.csv( ensembleModelMeans, "EnsembleModelMeans.csv", row.names=FALSE, quote=FALSE );

The manually selected linear regression ensemble model was constructed using three predictors: Random Forest, Gender model, and QDA. The training set score was 0.8575 and test set score was 0.7847. The simple row means model scored a 0.8664 against the training set and a 0.7703 against the test set. As would be expected, the row means model, which used some very overfit models, was itself overfit - scoring high against the training set and low against the test set - lower than the manually selected ensemble model.

Cheat Models

Cheaters never prosper. Nevertheless.

Google Data-Mining

All of the Titanic data is available online. It is possible to find resource pages listing who died and who survived. Of course, much of it is not in a standardized format and has to be screenscraped. In addition, there are errors and discrepancies both in the Kaggle data set and in the data online. Word for word searches fail in many cases.

library(RCurl);
library(XML);
library(readr);
#testIDs = 892:nrow(data_combined);
#userAgents = c( "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0", "Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36", "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1", "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0)");
userAgents = c( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063" );
numUserAgents = length( userAgents );
trainIDs = 1:891;
testIDs = 892:nrow(data_combined);
setIDs = 1:nrow(data_combined);
trainURLS = c();
setHits = data_combined[,c("PassengerId","Survived")];
setHits$SurvivorHits = 0;
setHits$VictimHits = 0;
setHits$SurvivorFirst = 0;
setHits$SurvivedPrediction = 0;
for ( pass_idx in 1:length(setIDs) ){
  i = setIDs[pass_idx];
  name = gsub( " ", "+", as.character( data_combined[i,"NameString"] ) ); # remove whitespace
  sex = as.character( data_combined[i,"Sex"] );
  searchURL = paste0( "https://www.google.com/search?q=", name, "+", sex, "+titanic" );
  trainURLS[i] = searchURL;
  randomUserAgent = userAgents[ round( runif(1, 1, numUserAgents) ) ];
  wd = getwd();
  localFile = paste0( wd, "/html/", i, ".html" );
  if ( file.exists( localFile )){
    resultsHTML = read_file( localFile );
  }else{
    resultsHTML = getURL( searchURL, httpheader=c('User-Agent'=randomUserAgent) );
    write( resultsHTML, localFile );
    Sys.sleep( runif( 1, 27, 67)); # randomly sleep between 27 and 67 seconds
    # this is necessary because if Google thinks you're mining their data with a bot
    # you will be grounded and sent to your room for a timeout.
  }
  # now search for "victim" and "survivor"
  victimHits = gregexpr("victim", resultsHTML, ignore.case=TRUE);
  victimHitIndices = which( victimHits[[1]] > -1 );
  numVictimHits = length( victimHitIndices );
  setHits[pass_idx,"VictimHits"] = numVictimHits;
  firstVictimHit = victimHits[[1]][1];
  survivorHits = gregexpr("survivor", resultsHTML, ignore.case=TRUE);
  survivorHitIndices = which( survivorHits[[1]] > -1 );
  numSurvivorHits = length( survivorHitIndices );
  setHits[pass_idx,"SurvivorHits"] = numSurvivorHits;
  firstSurvivorHit = survivorHits[[1]][1];
  #cat( "row ", i, "\n")
  #assert_that( ( numSurvivorHits > 0 ) || (numVictimHits > 0) );
  if (!( ( numSurvivorHits > 0 ) || (numVictimHits > 0) ) ){
    cat( "Could not find 'victim' or 'survivor' keyword for PassengerID ", i, ".  Manually investigate and update .html file with 'survived' or 'victim'.\n")
  }
  survived = FALSE;
  if ( numSurvivorHits > numVictimHits ){
      survived = TRUE;
  }
  else if ( ( numSurvivorHits > 0 ) && ( numVictimHits > 0 ) ){
    # both are positive, take the min
    if ( firstSurvivorHit < firstVictimHit ){
      survived = TRUE;
    }
    else{
      survived = FALSE;
    }
    setHits[pass_idx,"SurvivorFirst"] = as.numeric( survived );
  }
  survived_flag = 0;
  if( survived ){
    survived_flag = 1;
  }
  setHits[pass_idx,"SurvivedPrediction"] = survived_flag;
  if ( !is.na( data_combined[i,"Survived"] ) && survived_flag != data_combined[i,"Survived"] ){
    #cat( "Prediction does not match truth at ", i, as.character( data_combined[i,"Name"] ), "\n");
  }
}
# evaluate training set performance 
trainingPredictions = setHits$SurvivedPrediction[1:891];
confusionMatrix( trainingPredictions, data_combined$Survived[1:891])

cheatTestPredictions = setHits$SurvivedPrediction[892:nrow(data_combined)];

write.csv( setHits, file="CheatModel1.csv", quote=FALSE);

Screen Scraping Titanic Resource Webpage

A webpage exists that lists all the survivors in bold and all the passengers who died in normal font. It is possible to build a scraper to parse the passengers into two lists and then try to match the names in the Kaggle test set with the names in the Survived and Died sets derived from the webpage. The idea is to use both a bag-of-words model and a distance model to capture both number of token hits and handle any differences in spelling.

# read in CheatModel1
cheat <- read.csv("CheatModel1.csv", header = TRUE);
cheat$SurvivedFactor = as.factor( cheat$Survived );
cheat$SurvivedTokenMatches = 0;
cheat$DiedTokenMatches = 0;
cheat$SurvivedTokenDistance = 0;
cheat$DiedTokenDistance = 0;
resourceURL = "http://www.titanic-whitestarships.com/1st_Class_Pass.htm";
# Example HTML:
#<strong>Gordon, Sir Cosmo Duff<br>
#Gordon, Lady Lucile Duff<br>
#and Maid (Miss Laura Mabel Francatelli)<br>
#Gracie, Colonel Archibald IV</strong><br>
#Graham, Mr. George Edward<br>
#Graham1, Mr. George Edward<br>
#Graham2, Mr. George Edward<br>
#Graham3, Mr. George Edward<br>
#<strong>Graham, Mrs. William Thompson<br>
#(nee Edith Junkins)<br>
#Graham, Miss Margaret<br>
#Greenfield, Mrs. Leo David<br>
#(nee Blanche Strouse)<br>
#Greenfield, Mr. William Bertram</strong><br>
#Guggenheim, Mr. Benjamin<br>
#and Manservant (Victor Giglio)</font></p>
cheatSurvived = numeric(nrow(newTest));
wd = getwd();
localFile = file.path( wd, "TitanicResource.htm" );  
# know the file is approx 2100 lines.  Initialize each died/survived list to 3000
died = character(3000);
survived = character(3000);
if ( !file.exists( localFile )){  # Do this so we can load the URL once and not spam it everytime we run this script
  resultsHTML = getURL( resourceURL );
  write( resultsHTML, localFile );
}
# now, parse the file, line-by-line.  Survivors are in bold, denoted by the <strong> tag:
fid = file( localFile, "r");
isStrong = FALSE;
diedCtr = 1;
survivedCtr = 1;
while( TRUE ){
  line = readLines(fid,n=1);
  if( length(line) == 0 ){
    break;
  }
  # look for </strong>
  thisLineStartStrong = grepl( "^<strong>", line );
  thisLineEndStrong = grepl( "</strong>", line );
  # if line contains <strong
  # strip html tags
  rawText = gsub( "<.*?>", "", line );
  # skip if no comma
  thisLineHasComma = grepl( ",", rawText );
  if ( !thisLineHasComma ){
    next;
  }
  # remove punctuation
  rawText = gsub( "[[:punct:]]", "", rawText );
  # skip if line (name) is too long
  if ( nchar( rawText) > 100 ){
    next;
  }
  if( isStrong || thisLineStartStrong ){
    survived[survivedCtr] = rawText;
    survivedCtr = survivedCtr + 1;
  }
  else{
    died[diedCtr] = rawText;
    diedCtr = diedCtr + 1;
  }
  isStrong = thisLineStartStrong && !thisLineEndStrong;
}
close(fid);
# truncate the died/survived lists
survived = sort( survived[1:survivedCtr-1] );
died = sort( died[1:diedCtr-1] );  # yes, this includes crap lines at the top of list
#

# now have two lists and need to do bag of words searching vs. each name
# for each name, tokenize, then query the died/survived lists
for( i in 1:nrow( newTest ) ){
  thisName = newTest[i,"Name"];
  # remove non-word characters 
  thisName = gsub( "[[:punct:]]", "", thisName );
  # now tokenize each name and search each died/survived for distance from each token
  nameTokens = strsplit( thisName, "\\s+")[[1]];
  numTokens = length( nameTokens );
  # search through survived
  survivedTokenMatches = numeric(length(survived));
  survivedTokenDistance = numeric(length(survived));
  for ( j in 1:length( survived ) ){
    thisScore = 0;
    thisSurvived = survived[j];
    survivedTokens = strsplit( thisSurvived, "\\s+")[[1]];
    numSurvivedTokens = length( survivedTokens );
    # now, take each name token and search for match in survived tokens
    for ( k in 1:numTokens){
      thisNameToken = nameTokens[k];
      if ( thisNameToken %in% survivedTokens ){
        thisScore = thisScore + 1;
      }
    }
    survivedTokenMatches[j] = thisScore;
    # go through each name token and find the min distance to survived tokens
    totalDist = 0;
    for ( k in 1:numTokens){
      thisNameToken = nameTokens[k];
      wordDist = 10;
      distances = numeric(numSurvivedTokens);
      for ( m in 1:numSurvivedTokens){
        thisSurvivedToken = survivedTokens[m];
        distances[m] = adist( thisNameToken, thisSurvivedToken );
      }
      totalDist = totalDist + min( distances );
    }
    survivedTokenDistance[j] = totalDist;
  }
  bestSurvivedScore = max( survivedTokenMatches );
  bestSurvivedDistance = min( survivedTokenDistance );
  cheat[i,"SurvivedTokenHits"] = bestSurvivedScore;
  cheat[i,"SurvivedTokenDistance"] = bestSurvivedDistance;
  
  # search through died
  diedTokenMatches = numeric(length(died));
  diedTokenDistance = numeric(length(died));
  for ( j in 1:length( died ) ){
    thisScore = 0;
    thisDied = died[j];
    diedTokens = strsplit( thisDied, "\\s+")[[1]];
    numDiedTokens = length( diedTokens );
    # now, take each name token and search for match in survived tokens
    for ( k in 1:numTokens){
      thisNameToken = nameTokens[k];
      if ( thisNameToken %in% diedTokens ){
        thisScore = thisScore + 1;
      }
    }
    diedTokenMatches[j] = thisScore;
    # go through each name token and find the min distance to survived tokens
    totalDist = 0;
    for ( k in 1:numTokens){
      thisNameToken = nameTokens[k];
      wordDist = 10;
      distances = numeric(numDiedTokens);
      for ( m in 1:numDiedTokens){
        thisDiedToken = diedTokens[m];
        distances[m] = adist( thisNameToken, thisDiedToken );
      }
      totalDist = totalDist + min( distances );
    }
    diedTokenDistance[j] = totalDist;
  }
  
  bestDiedScore = max( diedTokenMatches );
  bestDiedDistance = min( diedTokenDistance );
  cheat[i,"DiedTokenHits"] = bestSurvivedScore;
  cheat[i,"DiedTokenDistance"] = bestSurvivedDistance;
  #cat( thisName, " bestScore = ", bestScore, ", bestMatch = ", survived[bestIdx], "\n");
}

# now that we have some cheating scores, build a simple tree model
library(tree);
cheat_train = cheat[1:891,];  # use the training data
cheat_test= cheat[892:nrow(cheat),];  # use the test data
tree_cheat_model = tree( SurvivedFactor~SurvivorHits+VictimHits+SurvivorFirst+SurvivedTokenHits+SurvivedTokenDistance+DiedTokenHits+DiedTokenDistance,data=cheat_train );
summary( tree_cheat_model )
plot( tree_cheat_model )
text( tree_cheat_model, pretty=0);
cheat_train_pred = predict( tree_cheat_model, cheat_train, type="class");
cheat_test_pred = predict( tree_cheat_model, cheat_test, type="class");
# performance
confusionMatrix(cheat_train_pred, cheat_train$Survived);

cheatModelTest = data.frame(PassengerId=newTest$PassengerId, Survived=as.numeric(cheat_test_pred)-1);
write.csv( cheatModelTest, "CheatModelFull.csv", row.names=FALSE, quote=FALSE );

Summary

Here is the summary of training and test results for the different models:

Model CVMethod TrainingScore TestScore
BagModel - 0.9876543 0.7320574
BagSigModel - 0.9854097 0.7272727
NeuralNetworkFullModel - 0.9203143 0.6746411
NeuralNetworkSigModel - 0.9124579 0.7105263
BoostedTreeModel 8-fold CV 0.8933782 0.7559809
BoostedTreeSigModel 8-fold CV 0.8922559 0.7368421
KNNBest8Model LOOCV 0.8787879 0.7559809
KNNBest6Model LOOCV 0.8754209 0.7200957
KNNBest7Model LOOCV 0.8754209 0.7344498
KNNBest9Model LOOCV 0.8742985 0.7631579
BagSigModel - 0.8664000 0.7272727
RandomForestSigModel - 0.8608305 0.7822967
BagModel - 0.8575000 0.7320574
RandomForestModel - 0.8574635 0.7846890
KNNFullModel LOOCV 0.8563412 0.7607656
TreeSigModel Pruning 0.8540965 0.7607656
NeuralNetworkSigModelCV 10-fold CV 0.8529742 0.7751196
TreeModel Pruning 0.8507295 0.7727273
RandomForestTrimmedModel - 0.8484848 0.7775120
NeuralNetworkFullModelCV 10-fold CV 0.8417508 0.7846890
LogisticRegressionSigModel - 0.8395062 0.7607656
SVMModel 10-fold CV 0.8395062 0.7679426
LassoRegressionSigModel 10-fold CV 0.8372615 0.7775120
LDAModel - 0.8372615 0.7727273
LassoRegressionFullModel 10-fold CV 0.8361392 0.7703349
LDAReducedModel - 0.8338945 0.7727273
LogisticRegressionFullModel - 0.8338945 0.7607656
RidgeRegressionSigModel 10-fold CV 0.8249158 0.7870813
RidgeRegressionFullModel 10-fold CV 0.8226712 0.7846890
QDAModel - 0.8159371 0.7296651
WomenAndChildrenFirstModel - 0.7901235 0.7511962
gender_submission - 0.7867565 0.7631579
QDASigModel - 0.7845118 0.7272727
NullModel - 0.6161616 0.6244019

Now then, how does one go about picking a model out of all of these? Clearly, the training scores of some of the more complex models cannot be trusted as they’re obviously overfit. The bagged tree model has training accuracies of about 0.98! But how would we know that they’re overfit without peeking at the test scores? The boosted tree cross-validation accuracies were about 0.89 - but these also proved to be vastly overfit.

But suppose all of these training results were more or less equal - some at 0.84 cross-validated training set accuracy, some at 0.83 cross-validated training set accuracy, and one at 0.85 cross-validated training set accuracy. What then would be the criterion for choosing a model? At first blush, it might be obvious to choose the one that that has the highest cross-validated training set accuracy of 0.85. But what if that model is very complex with a lot of parameterization and there is a much simpler model that scored at 0.84. Then would it be better to choose the simpler, but lower-scoring model? Is the difference between 0.84 and 0.85 statistically significant or is the difference just due to random chance? Is there a way to quantify the statistical significance? I’ve seen McNemar’s test but it looks like it operates on the correct counts between only two different models, not when num_models > 2. And if we run McNemar’s test over all pairs of models, then we’re likely to see something “significant” just through chance and need to somehow correct for it.

If I had to pick the “best” model out of these, it would probably be the ridge regression model. The squared regularization penalty helps to drive down the coefficients of some of the predictors better than the similar lasso regression model.

Lessons Learned

This report is a little stream-of-consciousness as it is/was my first Kaggle competition and kernel. But what a great introduction to overfitting with R. The EDA portion was really enjoyable as was the feature engineering section. I thought that time put in to the early part would pay off later on but in retrospect, I probably should have tried to regularize the data a little more with outlier detection and deletion. But I was reluctant to delete any data that might help me eek out a few more percentage points!

Also, in retrospect, it is much easier to run all of the models through the caret framework. In this model, a single feature set can be run through different models fairly seemlessly - and then even run as ensembles at a later step.

Thanks for reading along!