Titanic - Machine learning from Disaster using Random Forest Classifier
- Mayukha Thumiki
- Mar 9, 2023
- 7 min read
Updated: Mar 10, 2023

Preface:
As we all know, the Titanic, one of the most extravagant ships ever built, sadly sank on its maiden voyage, killing more than 1500 people. Many variables contributed to the extremely high number. This data was gathered for the Kaggle challenge titled 'Titanic - Machine Learning from Disaster'. This article uses the Random Forest Classifier to predict how long individuals will live using all of these factors. Additionally, I have compared the effectiveness of a voting classifier and unprocessed data.
Understanding the data: The first stage is to use pandas to read the data from Kaggle. In order to execute these operations on the data as well use the data in the shape of a multi-dimensional array, we import the Pandas library and Numpy. As shown in the code snippet below, the data is read using the 'pd', and the data used for testing is saved in the 'test data' variable while the data used for training is stored in the 'train data' variable.


Upon using the head() method, we get the output of the first 10 rows of the dataset. It consists of coloums - ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']. train data has 12 features and 891 records while the test data has 11 features and 418 records. Data from all the columns except 'Survived' are fit to be input to the model since this is a dependent variable, however, we shall remove the ones which are not required and process the ones with defect such as then missing values.


There are different ways of moving forward with the code, i.e. by selecting only few features and training the model as shown below.
In the first method, we identify pattern in the data based on the gender of the passenger. For this purpose, we try to find out the percentage of female and male passengers who survived and apply this result to test data. In this case, the gender of the passenger becomes the main deciding factor. The code below shows the rate of female and male passengers.


Random Forest Classifier Model building: For the purpose of training we import the RandomForestClassifier from sklearn.ensemble.

The model is now trained using selected number of features i.e., ['Pclass', 'SibSp', 'Parch', 'Sex']. The below snippet shows how the data looks after using only selected features from the training data.

As discussed earlier, the dependent variable 'y' takes the values of 'Survived' column from the training data and is used for fitting the model. The model is trained using RandomForestClassifier with 100 estimators and 5th level depth. After the model is trained using 'X' which is the one shown in fig 8 and 'y', predictions are made based on test data 'X_test' which also has same features as that of the train data. These predictions are stored in a csv file which has the 'PassengerID' column taken from test dataset and 'Survived' column whose values come from our (here, model) predictions.


The submission file contains 148 records and gives an accuracy of about 77.5% when submitted to the competition which implies that 77.5% of the records are predicted correctly.
But, do the other column values play any role at all? Can we improve the accuracy by processing any features of the dataset? The answer is, YES. Inconsistencies and mistakes in the data, such as missing numbers, incorrect data types, and outliers, can be found and fixed with the help of data preprocessing. In order to do this, redundant or unnecessary features must be found and eliminated from the collection. Machine learning model effectiveness can be significantly impacted by data preprocessing. Preprocessing the data will increase the models' performance, which will eventually produce better outcomes. We will now explore Data Pre-Processing steps which will help us improve the accuracy.
My Contribution:
In the second method, we carefully evaluate each feature and process it, the first step of which is to identify the missing values.

Pre-processing the Train data: Let's start step by step. We first fill up the missing values in with the average of all other null values (since it is of numeric type), then move on the dropping the 'Cabin' column since it has high number of missing values (rate of missing values is 0.77), and then replace the two missing values in 'Embarked with the most frequently occurring value ('S' appears 644 times out of 891). We then drop 'PassengerID', 'Name', 'Ticket' and 'Fare' from the train dataset since these values do not contribute to the survival of the passenger. Just imagine, if you knew a passenger's name, could you ever determine whether or not they had escaped the Titanic disaster? Not really. Now we are left with seven features i.e. ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked']. From this list, we drop the Survived column and use this as 'y_tr' which means that we use this as the dependent variable for training the model. The data looks this way after we are left with six features.




Pre-processing the Test data: The process followed for train data processing needs to be performed on test data as well, however, our work is half done since now we know that only 6 features shall be used for testing as well. So we find out the missing values and fill up those spaces from in the test data.

Since 'Age' has 86 missing values, we fill it up with the mean of the other values as done on the train data. The first five records of the entire dataset looks as shown in fig 17 below after performing data pre-processing on the selected six columns of the test data.

Before moving on to model training, we store the train and test data into 'X_tr' and 'X_ts' variables, and use the get_dummies() function in the pandas library in Python that is used to create dummy variables from categorical data. Dummy variables are binary variables that are used to represent categorical data in a way that can be easily used in machine learning models.


Building the model- Comparison between Random Forest and Voting Classifier
Random Forest Classifier: Similar to the training in the first approach (without processing data), we train the model using RandomForestClassifier with 900 estimators and 7th level depth. One of the most crucial hyperparameters in the Random Forest Classifier is the amount of estimators. Up until a certain point, increasing the number of estimators increases the model's accuracy; after that, the gain in accuracy becomes minimal. It is necessary to strike a balance between model performance and computational efficiency because adding more estimators also raises the expense of computation. Another crucial hyperparameter in the Random Forest Classifier is the depth of each tree. Too little depth can result in underfitting where the model is over simplified, on the other hand, overfitting can make the model complex. To select the ideal depth, it is advised to experiment with various depths and assess the model's performance using a test set. I have tested with different values for n_estimators and max_depth and finally settled for this value because of the results i.e. the submission file has 129 records which achieved approx. 79.2% accuracy when submitted to the kaggle competition.

Voting Classifier: Voting Classifier is a machine learning ensemble learning method that combines various models to enhance the performance of prediction. Multiple models are trained on the same dataset for a voting classifier, and their predictions are then combined to produce the final estimate. The theory behind this method is that by combining the predictions of various models, the prediction's total accuracy can be increased. There are two type of voting - 'hard' and 'soft'. Here, I have used the 'hard' voting which means that the ultimate prediction is made using the model that received the most votes. Also, I have combined Logistic Regression, Random Forest, K-Neighbors, Gaussian Naive Bayes and Support Vector classifiers. For this purpose, we need to import the required libraries from sklearn as shown.

Voting classifier requires the estimators to be defined in one of it parameters value. So we define the list of estimators before fitting the model. After training, the results are saved to csv file, the length of the file was 139 and it achieved an accuracy of 77.9% when submitted to the kaggle competition.


Result and Analysis:
Random Forest Classifier Without Data pre-processing | Random Forest Classifier after Data pre-processing | Voting Classifier after Data pre-processing | |
No. of records in submission.csv file | 148 | 129 | 139 |
Accuracy | 77.51 | 79.17 | 77.99 |
Features used | 'Pclass', 'SibSp', 'Parch', 'Sex' | 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked' | 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked' |
Parameter values for RFC | n_estimators = 100 max_depth = 5 | n_estimators = 900 max_depth = 7 | n_estimators = 900 max_depth = 7 |
From the above results, we can conclude that the data pre-processing made a significant impact (improved by about 2%) on the performance of the model. However, the voting classifier did not perform better than Random Forest Classifier even if it did perform better than Alexis B Cook's code. The possible explanation for this could be that the model might have been overfit which led to the overall test error to increase and performance to decrease. Thus, the code provided by Mr. Cook has given a model performance of 77.5% and I have improved the performance to 79.2% by processing the data and trying different model parameters. Following points summarize my contribution.
Filling up missing values of 'Age' and 'Embarked'.
Dropping unwanted columns such as 'PassengerID', 'Name', 'Ticket', 'Fare'.
Dropping the column with high missing values.
Increase the number of estimators and depth of tree to get the best optimal performance.
Try out other models and identify the lack in performance of other models.
Challenges and outcomes:
I had done a tremendous amount of study to determine the features that would affect the passengers' survival. In some instances, 'Cabin', it appeared that the 'Cabin' was significant and that the empty values needed to be filled in with random numbers. However, after careful consideration, I decided that it could be omitted.
Hyper parameter tuning: It took considerable effort and experimentation to select the model's ideal hyper parameters to get the best outcomes.
There were limited attempts (10 per day) to test the model's performance in the competition. Therefore, before uploading the file and receiving the actual results, I had to approximately estimate the accuracy, improve further, and then estimate the accuracy again.
The idea of a voting classifier was new to me. Prior to using it in the project, I had to master it myself.
Link to Kaggle Notebook:
https://www.kaggle.com/thumikimayukha/titanic
References:
1. https://www.kaggle.com/code/alexisbcook/titanic-tutorial/notebook 2. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html 3. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html 4. https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.drop.html 5. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html





Comments