December, 2024

Kaggle
Titanic Problem

Task

On April 15, 1912, the RMS Titanic sank after colliding with an iceberg, resulting in the death of 1502 out of 2224 passengers and crew.

The aim of this challenge is to build a model that predicts whether a passenger on the Titanic in the test dataset survived or not based on passenger data (ie name, age, gender, socio-economic class, etc). A training dataset is provided.

Process

Upon exploring the dataset, it was observed that the "age" feature had around 21-22% missing values, "cabin" had around 80% missing values, and "embarked" and "fare" had a few missing values.

It was decided to:

fill in the age data with the median age for that passenger's sex and class
fill in the embarked data with research (as missing values were few)
fill in the fare data with the median fare for that passenger's class and family size
categorise the missing cabin data as "Missing"

Feature Engineering:

continuous features were binned
the family size feature was created by adding the sibling/spouse and parent/child feature values
the ticket number feature was frequency encoded
titles were extracted from the name feature and grouped into 5 categories
label encoded non-numerical features and one hot encoded categorical features
scaled the training and test data sets with the Standard Scaler in scikitlearn

Model:

the XGBoost model was chosen as it is known to perform well for this task
hyperparameters were tuned with Bayesian Optimisation using the HyperOpt library
the highest accuracy achieved was 0.79

Conclusion

I learned a lot about data preprocessing and manipulation, feature engineering and feature selection, and hyperparameter tuning. I also tried out GridSearchCV for hyperparameter tuning, but Bayesian Optimisation seemed to perform better.

There is definitely heaps more to learn and lots of room for improvement! Feel free to send me any comments or questions you have about the process or the code in the Jupyter Notebook on LinkedIn - I am extremely grateful for any feedback.