One of the biggest problems banks used to face was the large number of applications for credit cards. People could be rejected by having low income level or high loan balances and analyzing these applications one by one was a headache. Fortunately, today we have technology and this tasks can be automated with the power of machine learning.
Firstly we want our data in good shape so our model to make good predictions, consequentely we are going to preprocess the dataset by cleaning it so we can perform the exploratory data analysis. The goal is to build a machine learning model that can predict if an individual's application for a credit card will be accepted or not.
We will use the "Credit Approval Data Set" from the University of California, Irvine Machine Learning Repository, here.
The file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. since this data is confidential, the contributor of the dataset has anonymized the feature names.
import pandas as pd
import numpy as np
df = pd.read_csv("credit.data", header = None)
df.columns=['Gender', 'Age', 'Debt', 'Married', 'BankCstomer', 'EducationLevel', 'Ethnicity',
'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore', 'DriversLicense', 'Citizen', 'ZipCode',
'Income', 'Approved']
df.head()
Gender | Age | Debt | Married | BankCstomer | EducationLevel | Ethnicity | YearsEmployed | PriorDefault | Employed | CreditScore | DriversLicense | Citizen | ZipCode | Income | Approved | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | b | 30.83 | 0.000 | u | g | w | v | 1.25 | t | t | 1 | f | g | 00202 | 0 | + |
1 | a | 58.67 | 4.460 | u | g | q | h | 3.04 | t | t | 6 | f | g | 00043 | 560 | + |
2 | a | 24.50 | 0.500 | u | g | q | h | 1.50 | t | f | 0 | f | g | 00280 | 824 | + |
3 | b | 27.83 | 1.540 | u | g | w | v | 3.75 | t | t | 5 | t | g | 00100 | 3 | + |
4 | b | 20.17 | 5.625 | u | g | w | v | 1.71 | t | f | 0 | f | s | 00120 | 0 | + |
df.describe()
2 | 7 | 10 | 14 | |
---|---|---|---|---|
count | 690.000000 | 690.000000 | 690.00000 | 690.000000 |
mean | 4.758725 | 2.223406 | 2.40000 | 1017.385507 |
std | 4.978163 | 3.346513 | 4.86294 | 5210.102598 |
min | 0.000000 | 0.000000 | 0.00000 | 0.000000 |
25% | 1.000000 | 0.165000 | 0.00000 | 0.000000 |
50% | 2.750000 | 1.000000 | 0.00000 | 5.000000 |
75% | 7.207500 | 2.625000 | 3.00000 | 395.500000 |
max | 28.000000 | 28.500000 | 67.00000 | 100000.000000 |
The data has some issues that can affect the performance of our model if they go unfixed:
It contains both numeric and non-numeric data and also values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. The dataset also has missing values labeled with '?', which can be seen in the last cell's output.
So, we are going to replace temporarely this '?' characters with nans and impute the missing values with a strategy called mean imputation.
df = df.replace("?", np.NaN)
df.fillna(df.mean(), inplace=True)
df.isnull().sum()
Gender 12 Age 12 Debt 0 Married 6 BankCstomer 6 EducationLevel 9 Ethnicity 9 YearsEmployed 0 PriorDefault 0 Employed 0 CreditScore 0 DriversLicense 0 Citizen 0 ZipCode 13 Income 0 Approved 0 dtype: int64
Replacing its easy for numeric columns, however for non numeric we do something different. Similarly, we are going to inpute the missing values with the most frequent values as present in the respective columns.
for col in df:
if df[col].dtypes == 'object':
df = df.fillna(df[col].value_counts().index[0])
df.isnull().sum()
Gender 0 Age 0 Debt 0 Married 0 BankCstomer 0 EducationLevel 0 Ethnicity 0 YearsEmployed 0 PriorDefault 0 Employed 0 CreditScore 0 DriversLicense 0 Citizen 0 ZipCode 0 Income 0 Approved 0 dtype: int64
Before we proceed towards building our model, we still have some work to be done. We need to convert non-numeric columns into numeric, scale the feature values to a uniform range and split the data into train and test sets.
We do his because many machine learning algorithms require data to be strickly numeric, and also for faster computation. This technique is called label encoding.
Finally, we are going to scale our data from 0 to 1. For example, The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore of 1 is the highest since we're rescaling all the values to the range from 0 to 1.
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
le = LabelEncoder()
for col in df:
if df[col].dtype=='object':
df[col]=le.fit_transform(df[col])
df[df.columns[13]]
0 68 1 11 2 96 3 31 4 37 .. 685 90 686 67 687 67 688 96 689 0 Name: ZipCode, Length: 690, dtype: int32
df = df.drop(['DriversLicense', 'ZipCode'], axis=1)
df = df.values
X,y = df[:,0:13] , df[:,13]
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.33,
random_state=42)
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)
Basically, credit prediction is a classification task. According to University of California, this dataset contains more instances that correspond to denial status than instances corresponding to approved ones. Out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved. Our model would be succesful if it is able to accurately predict the status of the applications with respect to these statistics.
Subsequently, we want to evaluate the model on the test set with respect to classification accuracy. In the case of predicting credit approvals, it is equally important to see if our model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.
model = LogisticRegression()
model.fit(rescaledX_train, y_train)
LogisticRegression()
y_pred = model.predict(rescaledX_test)
print("Accuracy of logistic regression classifier: ", model.score(rescaledX_test, y_test))
confusion_matrix(y_test, y_pred)
Accuracy of logistic regression classifier: 0.8377192982456141
array([[92, 11], [26, 99]], dtype=int64)
Our model has an accuray of 83%, which its good. Lets see if we can make it better by performing a grid search of the model parameters. There are different parameters, but for this time we are going to look after tol and max_iter and see whci values work best.
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]
param_grid = dict(tol = tol, max_iter = max_iter)
grid_model = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
rescaledX = scaler.fit_transform(X)
grid_result = grid_model.fit(rescaledX, y)
best_score, best_params = grid_result.best_score_, grid_result.best_params_
print("Best: %f using %s" % (best_score, best_params))
Best: 0.850725 using {'max_iter': 100, 'tol': 0.01}
Thanks to the technology we have today, we now can automate tedious and time consumming tasks. Specifically, with this project we solve one of the problems banks most used to face: review a large number of applications for credits.
By building this credit predictor, we tackled some of the most widely-known preprocessing steps such as scaling, label encoding, and missing value imputation. We built a logistic regression model that can predict if a person's application for a credit would get approved or not given some information about that person.
An interesting question you can ask yourself is: Which are the features that affect the credit approval decision process? Are these variables correlated with each other? For this project we rely on our intuition that they indeed are correlated, but get to know wich ones are the most importants would be interesting.