Financial inclusion remains one of the main obstacles to economic and human development in Africa. For example, across Kenya, Rwanda, Tanzania, and Uganda only 9.1 million adults (or 14% of adults) have access to or use a commercial bank account.
Traditionally, access to bank accounts has been regarded as an indicator of financial inclusion. Despite the proliferation of mobile money in Africa, and the growth of innovative fintech solutions, banks still play a pivotal role in facilitating access to financial services. Access to bank accounts enables households to save and make payments while also helping businesses build up their creditworthiness and improve their access to loans, insurance, and related services. Therefore, access to bank accounts is an essential contributor to long-term economic growth.
The objective of this project is to create a machine-learning model to predict which individuals are most likely to have or use a bank account. The models and solutions developed can provide an indication of the state of financial inclusion in Kenya, Rwanda, Tanzania and Uganda while providing insights into some of the key factors driving individuals’ financial security.
Importing Libraries
1. Load the dataset
The above output shows the number of rows and columns for the train and test dataset. We have 13 variables in the train dataset, 12 independent variables and 1 dependent variable. In the test dataset, we have 12 independent variables. We can observe the first five rows from our data set by using the head() method from the pandas’ library.
We don’t have missing data in our dataset.
It is important to understand the meaning of each feature so we can really understand the dataset. We can read the VariableDefinition.csv file to understand the meaning of each variable presented in the dataset.
The SampleSubmission.csv gives us an example of how our submission file should look. This file will contain the uniqueid column combined with the country name from the Test.csv file and the target we predict with our model. Once we have created this file, we will submit it to the competition page and obtain a position on the leaderboard.
2. Understand the dataset
We can get more information about the features presented by using the info() method from pandas.
The output shows the list of variables/features, sizes, if it contains missing values and data type for each variable. From the dataset, we don’t have any missing values and we have 3 features of integer data type and 10 features of the object data type.
3. Data preparation for machine learning
Before we train the model for prediction, we need to perform data cleaning and preprocessing. This is a very important step; our model will not perform well without these steps.
The first step is to separate the independent variables and target(bank_account) from the train data. Then transform the target values from the object data type into numerical by using LabelEncoder.
The target values have been transformed into numerical datatypes, 1 represents ‘Yes’ and 0 represents ‘No’. We have created a simple preprocessing function to:
- Handle conversion of data types.
- Convert categorical features to numerical features by using One-hot Encoder and Label Encoder.
- Drop uniqueid variable.
- Perform feature scaling.
The processing function will be used for both train and test independent variables.
Preprocess both train and test dataset.
Observe the first row in the train data.
Observe the shape of the train data.
Now we have more independent variables than before (37 variables). This doesn’t mean all these variables are important to train our model. We need to select only important features that can increase the performance of the model. But we will not apply any feature selection technique in this project.
4. Model Building and Experiments
A portion of the train data set will be used to evaluate our models and find the best one that performs well before using it in the test dataset.
Only 10% of the train dataset will be used for evaluating the models. The parameter stratify = y_train will ensure an equal balance of values from both classes (‘yes’ and ‘no’) for both train and validation set. There are many models to choose from such as
We will be using XGBoost. We will start by training these models using the train set after splitting our train dataset.
The evaluation metric for this challenge will be the percentage of survey respondents for whom you predict the binary ‘bank account’ classification incorrectly. This means the lower the incorrect percentage we get, the better the model performance.
Let’s check the confusion matrix for XGB model.
Our XGBoost model performs well on predicting class 0 and performs poorly on predicting class 1, it may be caused by the imbalance of data provided(the target variable has more ‘No’ values than ‘Yes’ values). You can learn the best way to deal with imbalanced data here.
One way to increase the model performance is by applying the Grid search method as an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid. You can skip this cell if you would like.
The above source code will evaluate which parameter values for min_child_weight, gamma, subsample and max_depth will give us better performance.
Let’s use these parameter values and see if the XGB model performance will increase.
Our XGB model has improved from the previous performance of 0.118 to 0.113. We are aiming for our XGB model performance to be lower e.g. from 0.110 to 0.108. Keep improving the XGBoost model performance and see how the model performs on the test data set.