This notebook covers the Machine Learning process used to analyse the plane crash survivors data provided in Classification_train.csv and Classification_test.csv
The method used for predictions is Logistic Regression which gives us an accuracy of 95%
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
Importing warnings module to ignore FutureWarning and DeprecatedWarning
These warnings show us what features might get deprecated in future versions. The features work fine on the latest version as of today 3rd Nov 2018
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
A CSV file can be loaded as a DataFrame using pandas.read_csv
After loading, printing info and head to see what we're working with
dataset=pd.read_csv("Classification_train.csv")
dataset.head()
dataset.info()
Visualising Data is essential to see which features are more important and which features can be dropped
sns.barplot(x="Embarked", y="Survived", hue="Sex", data=dataset);
As we can see from the above barplot, more females survived in a plane crash by a high margin
sns.heatmap(dataset.corr(), annot=True)
The correlation heatmap shows that Survived is most strongly related to Fare, which means that higher fares mean better security in case of a mishap
dataset.describe()
We can see that the above description did not account for the columns Name, Sex, Ticket, Cabin, and Embarked as they are non-numeric
The following code gives the number of non-numeric string/categorical data, unique values, and the most frequent values with their frequency
dataset.describe(include=['O'])
dataset.head()
There seem to be some NaN values in the column Cabin
This shows that no data was availabe for the particular value.
Getting NaN values is common when dealing with real-world data and other columns might have missing data as well. We should check for these gaps before trying to apply any Machine Learning algorithms to the dataset.
Thankfully, a DataFrame class contains a function isnull() which checks for NaN values and returns a boolean value True or False.
We can count the number of NaN using sum() method
dataset.isnull().sum()
dataset[['Pclass','Survived']].groupby(by=['Pclass'],as_index=False).mean() # p1 class passenger survived more
By analysing at the dataset we see that the following features play an insignificant role in survivability.
NaN/null values might interfere with the accuracyWe can drop the unimportant columns from the dataset
Printing the head() to see what we're left with
dataset = dataset.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])
dataset.head()
NaN values¶The Embarked column is a non-numeric categorical set with only one missing element.
The following code fills the NaN value with the most frequent value
# Getting the most occured element using pandas get_dummies()
most_occ = pd.get_dummies(dataset['Embarked']).sum().sort_values(ascending=False).index[0]
# The above snippet makes a descending sorted array of the Embarked column and gets the first value
def replace_nan(x):
#Function to get the most occured element in case of null else returns the passed value
if pd.isnull(x):
return most_occ
else:
return x
#Mapping the dataset according to replace_nan() function
dataset['Embarked'] = dataset['Embarked'].map(replace_nan)
X will contain all the features
y will contain all the values observed that is the Survived column
So far, we've been dealing with the training set
# Select all rows and all columns except 0
X=dataset.iloc[:,1:8].values
# Select all rows from column 0
y=dataset.iloc[:,0].values
Since we've dropped unimportant features from our training data, the testing data must also be in the same format for accurately predicting the result. Using the same cleaning process as we did with the train dataset
# Load CSV into DataFrame
X_test=pd.read_csv("Classification_test.csv")
y_test=pd.read_csv("Classification_ytest.csv")
# Load CSV into DataFrame
X_test=pd.read_csv("Classification_test.csv")
y_test=pd.read_csv("Classification_ytest.csv")
X_test.head()
X_test is in the same format as our dataset, excluding the Survived Column
Columns that need to be dropped are :
X_test= X_test.drop(columns = ['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1).iloc[:,:].values
y_test.head()
y_test only needs to drop the PassengerId column
y_test=y_test.drop(columns = 'PassengerId', axis=1).iloc[:,:].values
Now that the train and test data is in the same format, we can proceed to manipulation of data
NaN values¶Age column has many NaN values which we will fill with the median/most frequent age from the dataset
Fare column has some NaN values in the test dataset which we plan on filling with the mean fare
# Age column having 177 missing values : dataset['Age'].isnull().sum() in training
# Also for test dataset.
from sklearn.preprocessing import Imputer
# Check for NaN values and set insert strategy to median
imputer = Imputer(missing_values = 'NaN', strategy = 'median', axis = 0)
# imputer only accepts 2D matrices
# Passing values [:, n:n+1] only passes the nth columnn
# Here the 2nd column is the Age
imputer = imputer.fit(X[:,2:3])
X[:,2:3] = imputer.transform(X[:,2:3])
imputer = Imputer(missing_values = 'NaN', strategy = 'median', axis = 0)
imputer = imputer.fit(X_test[:,2:3])
X_test[:,2:3] = imputer.transform(X_test[:,2:3])
# Using insert strategy mean
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
# The 5th column is the Fare
imputer = imputer.fit(X_test[:,5:6])
X_test[:,5:6] = imputer.transform(X_test[:,5:6])
After the above snippet has executed, imputer will have replaced all the NaN values with the specified insert strategy.
Now we can move on to encoding and fitting the dataset into an Algorithm
LabelEncoder is used to convert non-numerical string/categorical values into numerical values which can be processed using various sklearn classes
It encodes values between 0 and n-1; where n is the number of categories
The features which need encoding are:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
# Column 6 is Embarked
X[:, 6] = labelencoder_X.fit_transform(X[:, 6])
X_test[:, 6] = labelencoder_X.fit_transform(X_test[:, 6])
# Column 1 is Sex
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
X_test[:, 1] = labelencoder_X.fit_transform(X_test[:, 1])
Often when we use LabelEncoder on more than 2 categories the Machine Learning algorithm might try to find a relation between the values such as Increasing or Decreasing or in a pattern. This results in lower accuracy.
To avoid this we can further encode the Labels using OneHotEncoder, it takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value. Thus the name OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0,1,6])
# 0 : Pclass column
# 1 : Sex
# 6 : Embarked
# OneHotEncoder takes and array as input
X = onehotencoder.fit_transform(X).toarray()
X_test = onehotencoder.fit_transform(X_test).toarray()
With One Hot Encoding complete, we can proceed to fit the data into our LogisticRegressor
LogisticRegression¶LogisticRegression is used only when the dependent variable/prediction is binary i.e only consists of two values. LogisticRegression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
from sklearn.linear_model import LogisticRegression
#Initializing the regressor
lr = LogisticRegression()
# Fitting the regressor with training data
lr.fit(X,y)
# Getting predictions by feeding features from the test data
y_pred = lr.predict(X_test)
Creating a scatter plot of actual versus predicted values
plt.scatter(y_test, y_pred, marker='x')
ConfusionMatrix is used to compare the data predicted versus the actual output.
It is a matrix in the form:

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)
To get the accuracy, we use ClassificationReport which measures the acuracy of the algorithm based on a ConfusionMatrix
An ideal classifier with 100% accuracy would produce a pure diagonal matrix which would have all the points predicted in their correct class.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
After analysing the given dataset and using LogisticRegression on the features, we see that the algorithm can accurately predict the survivability of a Passenger 95% of the time.