The Survey of Consumer Finances (SCF) is normally a triennial cross-sectional survey of U.S. families. The survey data include information on families’ balance sheets, pensions, income, and demographic characteristics. Information is also included from related surveys of pension providers and the earlier such surveys conducted by the Federal Reserve Board. No other study for the country collects comparable information. Data from the SCF are widely used, from analysis at the Federal Reserve and other branches of government to scholarly work at the major economic research centres.
In this project, we are going to work with SCF data survey sponsored by the US Federal Reserve. It tracks financial, demographic and opinion information about families in the United States. The survey is conducted every three years and we will work with an extract of the results from 2019.
EXPLORING THE DATA
Prepare Data
Import Data
First, we need to load the data which is stored in a compressed csv file: SCFP2019.csv.
Clean
The first thing you might notice here is that this dataset is huge – over 20,000 rows and 351 columns! We will not explore all of the features in this dataset, but we can look in the data dictionary for this project for details and links to the official CodeBook. For now, let’s just say that this dataset tracks all sorts of behaviours relating to the ways households earn, save and spend money in the United States.
For this project, we are going to focus on households that have been turned down for credit or feared being denied credit in the past 5 years. These households are identified in the “TURNFEAR” column.
Subset Data
We will subset the df to only households that have been tuned down or feared being turned down for credit (“TURNFEAR == 1”)
Explore
AGE
Now that we have our subset, let’s explore the characteristics of this group. One of the features is age group (AGECL).
AGE GROUPS
Looking at the CodeBook we can see that “AGECL” represents categorical data, even though the values in the column are numeric.
AGECL Age group of the reference person
- < 35
- 35 – 44
- 45 – 54
- 55 – 64
- 65 – 74
- >= 75
So before we create a visualization, let’s create a version of this column that uses the actual group names.
AGE GROUP BAR CHART
Now that we have better labels, let’s make a bar chart and see the age distribution of our group.
We noticed that by creating their own age groups, we have basically made a histogram for us compressed of 6 bins. Our chart is telling us that many of the people who fear being denied credit are younger. But the first two age groups cover a wider range than the other four. So it might be useful to look inside those values to get a more granular understanding of the data. To do that, we will need to look at a different variable: AGE, whereas AGECL was a categorical variable, AGE is continuous, so we can use it to make a histogram of our own.
AGE HISTOGRAM
It looks like younger people are still more concerned about being able to secure a loan than older people, but the people who are most concerned seem to be between 30 and 40 years old.
RACE
Now that we have an understanding of how age relates to our outcome of interest, let’s try some other possibilities, starting with race. If we look at the CodeBook for “RACE”, we can see that there are 4 categories.
1. WHITE (INCLUDE MIDDLE EASTERN/ARAB WITH WHITE)
2. BLACK/AFRICAN-AMERICAN
3. HISPANIC/LATINO
5. OTHER
Note that there is no 4th category here. If a value of 4 did exist, it would be reasonable to assign it to “Asian American / Pacific Islander” – a group that doesn’t seem to be represented in the dataset. This is a strange omission, but you’ll often find that large public datasets have these sorts of issues. the important thing is to always read the data dictionary carefully. In this case, remember that this dataset doesn’t provide a complete picture of race in America – something that we’d have to explain to anyone interested in our analysis.
RACE BAR CHART: CREDIT FEARFUL
This suggests that White/Non-Hispanic people worry more about being denied credit but thinking critically about what we are seeing, that might be because there are more White/Non-Hispanic in the population of the United States than there are other racial groups, and the sample for this survey was specifically drawn to be representative of the population as a whole.
RACE BAR CHART: WHOLE DATASET
How does this second bar chart change our perception of the first one? On the one hand, we can see that White/Non-Hispanic account for around 70% of the whole dataset, but only 54% of credit-fearful respondents. On the other hand, Black and Hispanic respondents represent 23% of the whole dataset but 40% of Credit Fearful respondents. In other words, Black and Hispanic households are actually more likely to be in the Credit Fearful Group.
DATA ETHICS: It’s important to note that segmenting customers by race or any other demographic group for the purpose of lending is illegal in the United States. The same thing might be legal elsewhere, but even if it’s making decisions for things like lending based on racial categories is clearly unethical. This is a great example of how easy it can be to use data science tools to support and propagate systems of inequality.
INCOME
What about income level? Are people with lower incomes concerned about being denied credit or is that something people with more money worry about? To answer that question, we’ll need to compare the entire dataset with our subgroup using the “INCCAT” feature, which captures income percentile groups. This time, though, we will make a single, side-by-side bar chart.
INCCAT Income percentile groups
- 0 – 20
- 21 – 39.9
- 40 – 59.9
- 60 – 79.9
- 80 – 89.9
- 90 – 100
INCOME CATEGORIES: CREDIT FEARFUL VS. CREDT FEARLESS
INCOME CATEGORIES: SIDE BY SIDE BAR CHART
Comparing the income categories across the fearful and non-fearful groups, we can see that credit-fearful households are much more common in the lower-income categories. In other words, the credit fearful have lower incomes. So based on all this, what do we know? Among the people who responded that they were indeed worried about being approved for credit after having been denied in the past 5 years, a plurality of the young and low income had the highest number of respondents that makes sense, right? Young people tend to make less money and rely more heavily on credit to get their lives off the ground; so having been denied credit makes them more anxious about the future.
ASSETS
Not all the data is demographic, though if you were working for a bank, you would probably care less about how old the people are, and more about their ability to carry more debt. If we were going to build a model for that, we’d want to establish some relationships among the variables, and making some correlation matrices is a good place to start. First, let’s zoom out a little bit. We’ve been looking at only the people who answered “yes” when the survey asked about “TURNFEAR”, but what if we looked at everyone instead? To begin with, let’s bring in a clear dataset and run a single correlation.
ASSETS VS HOME VALUE: WHOLE DATASET
That’s a moderate positive correlation, which we would probably expect, right? For many Americans, the value of their primary residence makes up most of the value of their total assets. What about the people in our TURNFEAR subset? Let’s see if there’s a difference in correlation.
ASSETS VS HOME VALUE: CREDIT FEARFUL
EDUCATION
First, let’s start with education levels “EDUC”, comparing credit fearful and non-fearful groups. EDUC Highest completed grade by reference person
- 1st, 2nd, 3rd, or 4th grade
2. 5th or 6th grade
3. 7th and 8th grade
4. 9th grade
5. 10th grade
6. 11th grade
7. 12th grade, no diploma
8. High school graduate – high school diploma or equivalent
9. Some college but no degree
10. Associate degree in college – occupation/vocation program
11. Associate degree in college – academic program
12. Bachelor’s degree (for example BA, AB, BS)
13. Master’s degree ( for example MA, MS, MENG, MED, MSW, MBA)
14. Professional school degree (for example MD, DDS, DVM, LLB, JD)
15. Doctorate degree (for example PHD, EDD)
-1. Less than 1st grade
EDUCATION: CREDIT FEARFUL VS. CREDIT FEARLESS
EDUCATION: SIDE BY SIDE BAR CHART
In this plot, we can see that a much higher proportion of credit-fearful respondents have only a high school respondents have only a high school diploma, while university degrees are more common among the non-credit fearful.
DEBT
ASSETS VS DEBT: WHOLE DATASET
ASSETS VS DEBT: CREDIT FEARFUL
You can see the relationship in our df_fear graph is flatter than the one in our df graph, but they are different. Let’s end with the most striking difference from our matrices, and make some scatter plots showing the difference between HOUSES and DEBT.
HOME VALUE VS DEBT: WHOLE DATASET
HOME VALUE VS DEBT: CREDIT FEARFUL
The outliers make it a little difficult to see the difference between these two plots; but the relationship is clear enough, our df_fear graph shows an almost perfect linear relationship, while our df graph shows something a little more muddled. You might also notice that the data points on the df_fear graph form several little groups.
CLUSTERING
IMPORT
SPLIT
We need to split our data, but we are not going to need a target vector or a test set this time around. That’s because the model we’ll be building involves unsupervised learning. It’s called unsupervised because the model does not try to map the input to a set of labels or targets that already exist.
VERTICAL SPLIT
BUILD MODEL
Take a second and run slowly through all the positions on the slider. At the first position, there’s a whole bunch of grey data points; and if you look carefully, you’ll see there are also three stars. Those stars are the centroids. At first, their position is set randomly. If you move the slider, one more position to the right, you’ll see all the grey points change colours that correspond to the three clusters.
Since a centroid represents the mean value of all the data in the cluster, we would expect it to fall in the centre of whatever cluster it’s in. That’s what will happen if you move the slider one more position to the right.
But since they moved, the data points might not be in the right clusters anymore. Move the slider again, and you’ll see the data points redistribute themselves to better reflect the new position of the centroids. The new clusters mean that the centroids also need to move, which will lead to the clusters changing again, and so on, until all the data points end up in the right cluster with a centroid that reflects the mean value of all those points. Let’s see what happens when we try the same with our “DEBT” and “HOUSES” data.
ITERATE
Now that we’ve had a chance to play around with the process a little bit, let’s get into how to build a model that does the same thing.
BUILD MODEL
And there it is, 42 datapoints spread across three clusters. let’s grab the labels that the model has assigned to the data points so we can start making a new visualization.
EXTRACT CLUSTER LABELS AND PLOT CLUSTERS
Using the labels we just extracted, let’s recreate the scatter plot from before, this time we will colour each point according to the cluster to which the model assigned it.
Nice, each cluster has its own colour. The centroids are still missing, so let’s pull those out.
EXTRACT CENTROIDS AND PLOT CENTROIDS
Let’s add the centroids to the graph.
That looks great.
Even though our graph makes it look like the clusters are correctly assigned, we need a numerical evaluation. The data we are using is pretty clear-cut but if things were a little more muddled, we’d want to run some calculations to make sure we got everything right. There are two metrics that we will use to evaluate our clusters. We will start with inertia, which measures the distance between the points within the same cluster.
INERTIA
The “best” inertia is 0, and our score is pretty far from that. That does not necessarily mean that our model is bad. Inertia is a measurement of distance like the mean absolute error. This means that the unit of measurement for inertia depends on the unit of measurement of our x- and y-axis.
And since “DEBT” and “HOUSES” are measured in tens of millions of dollars, it’s not surprising that inertia is so large.
However, it would be helpful to have a metric that was easier to interpret and that’s where silhouette score comes in. Silhouette Score measures the distance between different clusters. It ranges from -1 (the worst) to 1 (the best), so it is easier to interpret than inertia.
SILHOUETTE SCORE
Outstanding! 0.976 is pretty close to 1, so our model has done a good job at identifying 3 clusters that are far away from each other.
It’s important to remember that these performance metrics are the result of the number of clusters we told our model to create. In unsupervised learning, the number of clusters is a hyperparameter that you set before training your model. So, what would happen if we change the number of clusters? Will it lead to better performance? Let’s try.
FINDING THE BEST K
Now that we have both performance metrics for several different settings of n_clusters, let’s make some line plots to see the relationship between the number of clusters in a model and its inertia and silhouette scores.
INERTIA VS. CLUSTER
What we are seeing here is that, as the number of clusters increases, inertia goes down. We could get inertia to zero (0) if we told our model to make 4,623 clusters (the same observations in X), but those clusters would not be helpful to us. The trick with choosing the right number of clusters is to look for the “bend in the elbow” for this plot. In other words, we want to pick the point where the drop in inertia becomes less dramatic and the line begins to flatten out. In this case, it looks like the sweet spot is 4 or 5. Let’s see what the silhouette score looks like.
Note that, in contrast to our inertia plot, bigger is better. So we are not looking for a “bend in the elbow” but rather several clusters for which the silhouette score remains high. We can see that the silhouette score drops drastically beyond 4 clusters.
Given this and what we saw in the inertia plot, it looks like the optimal number of clusters is 4.
BUILD FINAL MODEL
PLOT FINAL CLUSTERS
We can see all four of our clusters, each differentiated from the rest by colour. We are going to make one more visualisation, converting the cluster analysis we just did to something a little more actionable: a side-by-side bar chart. To do that, we need to put our clustered data into a data frame.
SIDE-BY-SIDE BAR CHART: GET CENTROIDS
Do you see any similarities between the xgb and clusters_centers_ data frame above?
SIDE-BY-SIDE BAR CHART: BUILD CHART
The above side-by-side bar from the xgb chart shows the mean “DEBT” and “HOUSES” values for each of the clusters in the final_model.
In this plot, we have our four clusters spread across the x-axis, and the dollar amounts for home value and household debt on the y-axis.
The first thing to look at in this chart is the different mean home values for the four clusters.
Cluster 0 represents households with small to moderate home values, clusters 2 and 3 have high home values, and cluster 1 has extremely high values.
The second thing to look at is the proportion of debt to home value. In clusters 1 and 2, this proportion is around 0.5. This suggests that these groups have a moderate amount of untapped equity in their homes. But for group 0, it’s almost 1, which suggests that the largest source of household debt is their mortgage. Group 3 is unique in that they have the smallest proportion of dent to home value, around 0.4.
This information could be useful to a financial institution that wants to target customers with products that would appeal to them. For instance, households in group 0 might be interested in refinancing their mortgage to lower their interest rates while group 3 could be interested in a home equity line of credit.