Titanic dataset Submitted by: Submission date 8/1/2013 Declaration Author: Contents Dated: 29/12/2012 The database corresponds to the sinking of the titanic on April the 15th 1912. It is part of a database containing the passengers and crew who were aboard the ship, and various attributes correlating to them. The purpose of this task is to apply the methodology of CRISP-DMS and follow the phases and tasks of this model. Using the classification method in rapid miner and both the decision tree and INN algorithms, I will create a training model and try apply the class survived or didn’t survive.
If I apply a decision tree to the dataset as it is, I get a prediction rate of 78%. I will try various techniques throughout this report to increase the overall prediction rate. Data mining objectives: I would like to explore the pre conceived ideas I have about the sinking of the titanic, and prove if they are correct. Was there a majority of 3rd class passengers who died? What was the ratio of passengers who died, male or female? Did the location of cabins make a difference as to who survived? Did chivalry ring through and did Women and children first’ actually happen?
Data Understanding: Describe the data: Figure Class label: Survive (1 or O) 1 = survived, died. Type = Binomial. Total: 891. Survived: 342, Died: 549 Attributes: 10 attributes 891 rows The dataset have primarily a categorical type of attribute so there is low information content. This might indicate a decision tree would be an appropriate model to use. I can see that the number of rows in the dataset is indeed 10 to 20 times the number of columns, so the number of instances is adequate. There doesn’t seem to be any inconsistency’s in the data.
Pappas: 1st, 2nd, or 3rd class. Type: polynomial. Categorical, 3rd class: 491, 2nd class: 216, 1st class: 184 0 missing Name: Name of Sex: Male, female. Type: binomial. Male: 577, Female: 314 0 missing Age: from 0. 420 to 80. Average age: 29, standard deviation of 14+-, Max was 80. 177 missing Sibs (Siblings on board): Type: integer. Average less than 1, highest 8. This suggested an outlier, but on inspection the names where there were 8 siblings corresponded. (The name was sage, 3rd class passengers, all died. ) O missing Parch: number of parent’s, children onboard.
Type: integer. Average: 0. 3, deviation 0. 8. Max was 6. O missing Ticket: ticket number. Type: polynomial. To me these ticket numbers seem quite random and my first inclination is to discard them. O missing Fare: Cost of ticket. Type: real. Average: 32, deviation +- 49. Maximum 512. There seems to be quite a disparity in the range of values here. Three tickets cost 512, outliers? O missing Cabin: cabin numbers. Type: polynomial. 687 missing From looking at this data I think I can discount one of my initial questions about cabin numbers.
If there was more data it might be an interesting factor as regards cabin locations and survival. As it stands the quality of the data is not good, there are Just o many missing entries. I. E. Greater than 40%. So I will delete (filter out) the cabin attribute from the dataset. The age attribute could cause a problem with the amount of fields missing. There are too many to delete. I might use the average of all ages to fill in the blanks. Explore the data: From an initial exploration of the data, I was able to look at various plots and found some interesting results.
I have tried to keep my findings to my initial questions that I wanted answered. Was there a majority of 3rd class passengers who died? You can tell from Figure 2 that this was true. This graph Just shows survival by class, 3rd class fairing the worst. Again this is shown with a scatter plot but with the added attribute sex. You can see on the female side of the first class passengers, only a few died. Interestingly it shows that it was mostly male 3rd class passengers who perished, and it is demonstrated that more males then females died. There is a clear division in classes demonstrated.
This graph answers my other question. What was the ratio of passengers who died, male or female? From this we can see that mainly males did not survive. Although there were more males on board (577), about 460 perished. From the females (314), about 235 survived. Another attribute that needs attention is the age category. I wanted to find out if the women and children first policy was adhered to, but there are 177 missing age values. This is going to complicate my results on this. From leaving the 177 as they are, I get this graph: but this is not conclusive in Figure 5.
I thought that the fare price might indicate a children’s price and therefore allow me to fill in an age, but the fare price doesn’t seem to have much pattern. Another idea I thought might help would be to look at the names of passengers, I. . Miss might signify a lower age. (In 1912 the average age of marriage was 22, so anyone with title miss could have an age less than 22. ) Names which include master might indicate a young age as well. Figure 5 also indicates possible outliers on the right hand side. From this graph I could easily see the breakdown of the different class of passenger and where they embarked from.
It is obvious that Southampton had the largest number of passengers get on board. Question had the highest proportion of 3rd class passengers compared to 2nd and 1st class at that port, and it’s also interesting o note that this was an Irish port. This graph further explores the port of embankment and shows the survival rate from each, as well as the different classes. To me it seems that the majority of 3rd class passengers were lost who came from Southampton port, although they did have the highest amount of 3rd class passengers. A closer look at Southampton port.
The majority who didn’t survive were 3rd class (blue), also noted is the handful of 1st class passengers (green) who died, yet Southampton had the highest number of 1st class passengers to board. See figure 6. Verify data quality There were a number of missing values in the dataset. The highest amount of missing data came from the cabin attribute. As it is higher than 45% (687 missing) I decided to filter out this column. There are also 177 missing values from the age attribute. This amount of missing data is again too large a percentage to ignore and needs to be filled in.
I can see that the dataset contains less than 1000 rows, so I think that sampling will not have to be performed. There doesn’t seem to be any inconsistency’s in the data. There are still 2 missing pieces of information from the embankment attribute. I see that they are 1st class passengers so from my graph on embankment I think I can put her embankment from Churchgoer. The other passenger is a George Nelson, which I will add to Southampton. I decided to filter out names also. I don’t see how it can help in the dataset.
It may have helped with age, by looking at the title as I said, but for this I Just used the average age to replace the missing values. Another approach to filling in the missing age fields might be linear regression. Remove possible outliers? I can see that there may be some outliers. For instance in the fares attribute, there re three tickets which cost 512 when the average is 32. They were first class tickets, but the difference is huge. Data Preparation: Here is the result of using x validation on the dataset before any data preparation has taken place.
I will now sort out the problem of 667 cabin numbers missing. With it being higher than 40%, Vive decided to delete the attribute entirely. Vive also deleted the name attribute, as I don’t see how it will help. By deleting cabin, name and ticket, here is the result I get: I replaced the missing age fields with the average of ages, this increased the accuracy lightly and gave these results with x validation: I used detect outliers and picked the top ten and then filtered them out. This gave this result: The class recall for survived has not improved much.
Increasing the number of neighbors in the detect outliers operator improved things, also limiting the filter to deleting 5 made a better accuracy. I decided to use specified binning for the ages and broke the ages into three bins. For children aged up to 13, middle aged from 13 to 45, and older from 45 to 80. I tried different age ranges and found that these ranges yielded the best results. It did increase the accuracy. I also used binning for the fares, splitting them into low, mid, and high which also improved results on the confusion matrix.
I used detect outlier to find the ten most obvious outliers, and then used a filter to get rid of them. I have decided to remove cabin from the dataset, and also there are 177 missing age values which I have tried various approaches in changing. I changed the ages to the average age, but this gives a spike in the number of ages 29. 7. Example of average age problem: Modeling: I tried to implement both the decision tree and inn algorithms, seeing as the dataset as primarily categorical. I found that inn yielded the best results regarding accuracy. This was set at k=l . The accuracy was not great at 73%.
The parameter of K is too small and may be influenced by noise. INN: 5 worked the best at 82. 38%. This seems to be the optimal value for k, and the distance is set right. Class precision is about even on each class. Decision tree: The decision tree algorithm didn’t give me as much accuracy, and I found that turning off pre pruning gave me a better accuracy. From the decision tree, the age binning seemed to predict middle aged males (13 to 45) with a ‘low fare’ well. The class recall for survived was not great at 67. 85%. Generate Test Design I used x-validation to perform cross validation on the data.
I initially used 20 for the number of validations, but then found 25 achieved a better result. I used the apply model and performance operators as these are best used for classification tasks and work well with the polynomial attribute. This then presented me with a confusion matrix where I could measure the accuracy of my model by comparing the accuracy, recall and precision. I found that throughout my various testing of operators and valuating the confusion matrix, raising the class recall on true 1 (survived) most difficult. After all my efforts I managed to raise it to 73. 6%. I. E. 91 were incorrectly predicted as surviving. Figure Final result Workspace: From my initial objectives I was able to determine the answers using rapidness. I wanted to find out if those who perished were in the majority 3rd class passengers. I found this to be true, and also that the majority who died were male 3rd class passengers. Female passengers and children fared better than most which leads me to believe that the rule of women and children first applied. This may have been sighted more to the first and second class passengers as demonstrated in Figure 3.
Because the dataset had such a large amount of data missing concerning age, this was more difficult to determine. I found the embarked attribute to be interesting in the graphs I could generate from it. There seemed to be a large number of 3rd class passengers who died that had embarked from Southampton. If all the cabin numbers were present I wonder if Southampton 3rd class passengers had cabins close to where the iceberg hit? Did this have a bearing on their survival? From the different algorithms I used I found that Inn yielded the better results.