Kaggle - Titanic: Machine Learning from disaster

Balazs Balogh 2019

This is my learning path for the Titanic competition of kaggle. I've submitted more than 60 solutions, to try as much combination, and technique as I can. This model is constantly changing, but the solution below shows you how to land in the top 10%. Any questions, or tips are welcome: baloghbalazs646@gmail.com / github.com/bbalazs88

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly_express as px # conda install -c plotly plotly_express - install with anaconda prompt
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

data_train = pd.read_csv('https://raw.githubusercontent.com/DatasRev/workshop-prep/master/10_titanic_ml/train.csv')
data_test = pd.read_csv('https://raw.githubusercontent.com/DatasRev/workshop-prep/master/10_titanic_ml/test.csv')

Explore the data

In [2]:
data_train.head()
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [3]:
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [4]:
data_train.describe()
Out[4]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [5]:
data_test.head()
Out[5]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
In [6]:
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
In [7]:
data_test.describe()
Out[7]:
PassengerId Pclass Age SibSp Parch Fare
count 418.000000 418.000000 332.000000 418.000000 418.000000 417.000000
mean 1100.500000 2.265550 30.272590 0.447368 0.392344 35.627188
std 120.810458 0.841838 14.181209 0.896760 0.981429 55.907576
min 892.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 996.250000 1.000000 21.000000 0.000000 0.000000 7.895800
50% 1100.500000 3.000000 27.000000 0.000000 0.000000 14.454200
75% 1204.750000 3.000000 39.000000 1.000000 0.000000 31.500000
max 1309.000000 3.000000 76.000000 8.000000 9.000000 512.329200

It's a common practice to save the target values to a distinct variable. After that I joined the train and the test dataset for easier handling. Notice, that we drop the Survived column upon joining.

In [8]:
survived = data_train['Survived']

data = data_train.drop(['Survived'], axis=1).append(data_test)
In [9]:
data.head()
Out[9]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

We could see that Age / Fare / Embarked columns have NaN values. Cabin too, but there are so many of those, we would'n deal with them yet. Fill the Age and Fare with median values, Embarked with the most common one which is 'S'.

In [10]:
data['Age'].fillna(data['Age'].median(), inplace=True)
data['Fare'].fillna(data['Fare'].median(), inplace=True)
data['Embarked'].fillna('S', inplace=True)

data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1309 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1309 non-null float64
Cabin          295 non-null object
Embarked       1309 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 122.7+ KB

EDA

In [11]:
data_train[['Pclass', 'Survived']].groupby('Pclass').mean()

# Those who travelled on the first class had the most chance for survival.
Out[11]:
Survived
Pclass
1 0.629630
2 0.472826
3 0.242363
In [12]:
data_train[['Sex', 'Survived']].groupby('Sex').mean()

# Women had better odds.
Out[12]:
Survived
Sex
female 0.742038
male 0.188908
In [13]:
data_train[['Embarked', 'Survived']].groupby('Embarked').mean()

# C = Cherbourg, Q = Queenstown, S = Southampton
# Maybe the point of embarkment is not that important in case of survival.
Out[13]:
Survived
Embarked
C 0.553571
Q 0.389610
S 0.336957
In [14]:
data_train[['SibSp', 'Survived']].groupby('SibSp').mean().sort_values('Survived', ascending=False)

# 0-1-2 of Siblings or Spouses aboard were in the best position.
Out[14]:
Survived
SibSp
1 0.535885
2 0.464286
0 0.345395
3 0.250000
4 0.166667
5 0.000000
8 0.000000
In [15]:
data_train['SibSp'].value_counts()
Out[15]:
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64
In [16]:
data_train[['Parch', 'Survived']].groupby('Parch').mean().sort_values('Survived', ascending=False)

# Families with 1-2-3 members were in the best position.
Out[16]:
Survived
Parch
3 0.600000
1 0.550847
2 0.500000
0 0.343658
5 0.200000
4 0.000000
6 0.000000
In [17]:
data_train['Parch'].value_counts()
Out[17]:
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64
In [18]:
data_train['Cabin_stat'] = data_train['Cabin'].fillna(-1)
data_train['Cabin'].notna().sum()

print(data_train[data_train.Cabin.notna()].Survived.sum()/data_train[data_train.Cabin.notna()].Survived.count())
print(data_train[data_train.Cabin.isnull()].Survived.sum()/data_train[data_train.Cabin.isnull()].Survived.count())

# People with Cabin had two times more chance for survival. 
0.6666666666666666
0.29985443959243085

Visualisation

In [19]:
sns.catplot(x='Survived', col='Sex', kind='count', data=data_train)

# Women were most likely to survive.
Out[19]:
<seaborn.axisgrid.FacetGrid at 0x161b16a46d8>
In [20]:
px.histogram(data_train, x="Sex", y="Survived", histfunc="count", color="Survived")

# This was made with plotly_express. In just one line we can make interactive visualisations.