Tutorial: FICO Explainable Machine Learning Challenge¶
In this tutorial, we use the dataset form the FICO Explainable Machine Learning Challenge: https://community.fico.com/s/explainable-machine-learning-challenge. The goal is to create a pipeline by combining a binning process and logistic regression to obtain an explainable model and compare it against a black-box model using Gradient Boosting Tree (GBT) as an estimator.
[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
[2]:
from optbinning import BinningProcess
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import auc, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
Download the dataset from the link above and load it.
[3]:
df = pd.read_csv("data/FICO_challenge/heloc_dataset_v1.csv", sep=",")
[4]:
variable_names = list(df.columns[1:])
[5]:
X = df[variable_names].values
Transform the categorical dichotomic target variable into numerical.
[6]:
y = df.RiskPerformance.values
mask = y == "Bad"
y[mask] = 1
y[~mask] = 0
y = y.astype(int)
Modeling¶
The data dictionary of this challenge includes three special values/codes:
-9 No Bureau Record or No Investigation
-8 No Usable/Valid Trades or Inquiries
-7 Condition not Met (e.g. No Inquiries, No Delinquencies)
[7]:
special_codes = [-9, -8, -7]
This challenge imposes monotonicity constraints with respect to the probability of a bad target for many of the variables. We apply these rules by passing the following dictionary of parameters for these variables involved.
[8]:
binning_fit_params = {
"ExternalRiskEstimate": {"monotonic_trend": "descending"},
"MSinceOldestTradeOpen": {"monotonic_trend": "descending"},
"MSinceMostRecentTradeOpen": {"monotonic_trend": "descending"},
"AverageMInFile": {"monotonic_trend": "descending"},
"NumSatisfactoryTrades": {"monotonic_trend": "descending"},
"NumTrades60Ever2DerogPubRec": {"monotonic_trend": "ascending"},
"NumTrades90Ever2DerogPubRec": {"monotonic_trend": "ascending"},
"PercentTradesNeverDelq": {"monotonic_trend": "descending"},
"MSinceMostRecentDelq": {"monotonic_trend": "descending"},
"NumTradesOpeninLast12M": {"monotonic_trend": "ascending"},
"MSinceMostRecentInqexcl7days": {"monotonic_trend": "descending"},
"NumInqLast6M": {"monotonic_trend": "ascending"},
"NumInqLast6Mexcl7days": {"monotonic_trend": "ascending"},
"NetFractionRevolvingBurden": {"monotonic_trend": "ascending"},
"NetFractionInstallBurden": {"monotonic_trend": "ascending"},
"NumBank2NatlTradesWHighUtilization": {"monotonic_trend": "ascending"}
}
Instantiate a BinningProcess
object class with variable names, special codes and dictionary of binning parameters. Create a explainable model pipeline and a black-blox pipeline.
[9]:
binning_process = BinningProcess(variable_names, special_codes=special_codes,
binning_fit_params=binning_fit_params)
[10]:
clf1 = Pipeline(steps=[('binning_process', binning_process),
('classifier', LogisticRegression(solver="lbfgs"))])
clf2 = LogisticRegression(solver="lbfgs")
clf3 = GradientBoostingClassifier()
Split dataset into train and test. Fit pipelines with training data, then generate classification reports to show the main classification metrics.
[11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
[12]:
clf1.fit(X_train, y_train)
/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
return np.log((1. / event_rate - 1) * n_event / n_nonevent)
[12]:
Pipeline(steps=[('binning_process',
BinningProcess(binning_fit_params={'AverageMInFile': {'monotonic_trend': 'descending'},
'ExternalRiskEstimate': {'monotonic_trend': 'descending'},
'MSinceMostRecentDelq': {'monotonic_trend': 'descending'},
'MSinceMostRecentInqexcl7days': {'monotonic_trend': 'descending'},
'MSinceMostRecentTradeOpen': {'monotonic_trend': 'descen...
'MaxDelqEver', 'NumTotalTrades',
'NumTradesOpeninLast12M',
'PercentInstallTrades',
'MSinceMostRecentInqexcl7days',
'NumInqLast6M',
'NumInqLast6Mexcl7days',
'NetFractionRevolvingBurden',
'NetFractionInstallBurden',
'NumRevolvingTradesWBalance',
'NumInstallTradesWBalance',
'NumBank2NatlTradesWHighUtilization',
'PercentTradesWBalance'])),
('classifier', LogisticRegression())])
[13]:
clf2.fit(X_train, y_train)
/home/gui/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
[13]:
LogisticRegression()
[14]:
clf3.fit(X_train, y_train)
[14]:
GradientBoostingClassifier()
[15]:
y_pred = clf1.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.70 0.66 0.68 1004
1 0.70 0.74 0.72 1088
accuracy 0.70 2092
macro avg 0.70 0.70 0.70 2092
weighted avg 0.70 0.70 0.70 2092
/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
return np.log((1. / event_rate - 1) * n_event / n_nonevent)
[16]:
y_pred = clf2.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.67 0.66 0.67 1004
1 0.69 0.70 0.70 1088
accuracy 0.68 2092
macro avg 0.68 0.68 0.68 2092
weighted avg 0.68 0.68 0.68 2092
[17]:
y_pred = clf3.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.71 0.66 0.68 1004
1 0.70 0.75 0.73 1088
accuracy 0.71 2092
macro avg 0.71 0.70 0.70 2092
weighted avg 0.71 0.71 0.70 2092
Plot the Receiver Operating Characteristic (ROC) metric to evaluate and compare the classifiers’ prediction.
[18]:
probs = clf1.predict_proba(X_test)
preds = probs[:,1]
fpr1, tpr1, threshold = roc_curve(y_test, preds)
roc_auc1 = auc(fpr1, tpr1)
probs = clf2.predict_proba(X_test)
preds = probs[:,1]
fpr2, tpr2, threshold = roc_curve(y_test, preds)
roc_auc2 = auc(fpr2, tpr2)
probs = clf3.predict_proba(X_test)
preds = probs[:,1]
fpr3, tpr3, threshold = roc_curve(y_test, preds)
roc_auc3 = auc(fpr3, tpr3)
/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
return np.log((1. / event_rate - 1) * n_event / n_nonevent)
[19]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr1, tpr1, 'b', label='Binning+LR: AUC = {0:.2f}'.format(roc_auc1))
plt.plot(fpr2, tpr2, 'g', label='LR: AUC = {0:.2f}'.format(roc_auc2))
plt.plot(fpr3, tpr3, 'r', label='GBT: AUC = {0:.2f}'.format(roc_auc3))
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1],'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
The plot above shows the increment in terms of model performance after binning when the logistic estimator is chosen. Furthermore, a previous binning process might reduce numerical instability issues, as confirmed when fitting the classifier clf2
.
Binning process statistics¶
The binning process of the pipeline can be retrieved to show information about the problem and timing statistics.
[20]:
binning_process.information(print_level=2)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Begin options
max_n_prebins 20 * d
min_prebin_size 0.05 * d
min_n_bins no * d
max_n_bins no * d
min_bin_size no * d
max_bin_size no * d
max_pvalue no * d
max_pvalue_policy consecutive * d
selection_criteria no * d
fixed_variables no * d
categorical_variables no * d
special_codes yes * U
split_digits no * d
binning_fit_params yes * U
binning_transform_params no * d
verbose False * d
End options
Statistics
Number of records 8367
Number of variables 23
Target type binary
Number of numerical 23
Number of categorical 0
Number of selected 23
Time 2.3069 sec
The summary
method returns basic statistics for each binned variable.
[21]:
binning_process.summary()
[21]:
name | dtype | status | selected | n_bins | iv | js | gini | quality_score | |
---|---|---|---|---|---|---|---|---|---|
0 | ExternalRiskEstimate | numerical | OPTIMAL | True | 12 | 1.018368 | 0.116638 | 0.534387 | 0.031792 |
1 | MSinceOldestTradeOpen | numerical | OPTIMAL | True | 11 | 0.252786 | 0.030483 | 0.264740 | 0.027179 |
2 | MSinceMostRecentTradeOpen | numerical | OPTIMAL | True | 6 | 0.019086 | 0.002377 | 0.065597 | 0.000556 |
3 | AverageMInFile | numerical | OPTIMAL | True | 10 | 0.319379 | 0.038458 | 0.304157 | 0.128082 |
4 | NumSatisfactoryTrades | numerical | OPTIMAL | True | 10 | 0.126726 | 0.015424 | 0.180888 | 0.001210 |
5 | NumTrades60Ever2DerogPubRec | numerical | OPTIMAL | True | 4 | 0.178710 | 0.021915 | 0.200184 | 0.201631 |
6 | NumTrades90Ever2DerogPubRec | numerical | OPTIMAL | True | 3 | 0.133485 | 0.016301 | 0.155193 | 0.286527 |
7 | PercentTradesNeverDelq | numerical | OPTIMAL | True | 8 | 0.377803 | 0.045428 | 0.316946 | 0.101421 |
8 | MSinceMostRecentDelq | numerical | OPTIMAL | True | 7 | 0.289526 | 0.035246 | 0.272229 | 0.239494 |
9 | MaxDelq2PublicRecLast12M | numerical | OPTIMAL | True | 3 | 0.330280 | 0.040250 | 0.301670 | 0.833712 |
10 | MaxDelqEver | numerical | OPTIMAL | True | 4 | 0.236098 | 0.029129 | 0.257314 | 0.667940 |
11 | NumTotalTrades | numerical | OPTIMAL | True | 8 | 0.064716 | 0.008027 | 0.138545 | 0.011755 |
12 | NumTradesOpeninLast12M | numerical | OPTIMAL | True | 6 | 0.023530 | 0.002936 | 0.083770 | 0.007932 |
13 | PercentInstallTrades | numerical | OPTIMAL | True | 8 | 0.098610 | 0.012107 | 0.159569 | 0.077405 |
14 | MSinceMostRecentInqexcl7days | numerical | OPTIMAL | True | 4 | 0.166538 | 0.020460 | 0.211639 | 0.531041 |
15 | NumInqLast6M | numerical | OPTIMAL | True | 4 | 0.089956 | 0.011127 | 0.159369 | 0.323780 |
16 | NumInqLast6Mexcl7days | numerical | OPTIMAL | True | 5 | 0.083992 | 0.010394 | 0.153641 | 0.036291 |
17 | NetFractionRevolvingBurden | numerical | OPTIMAL | True | 9 | 0.574686 | 0.068232 | 0.410605 | 0.343593 |
18 | NetFractionInstallBurden | numerical | OPTIMAL | True | 5 | 0.037879 | 0.004724 | 0.105916 | 0.053723 |
19 | NumRevolvingTradesWBalance | numerical | OPTIMAL | True | 7 | 0.093376 | 0.011578 | 0.162108 | 0.011291 |
20 | NumInstallTradesWBalance | numerical | OPTIMAL | True | 5 | 0.014121 | 0.001762 | 0.059437 | 0.010423 |
21 | NumBank2NatlTradesWHighUtilization | numerical | OPTIMAL | True | 5 | 0.334853 | 0.041017 | 0.308402 | 0.222126 |
22 | PercentTradesWBalance | numerical | OPTIMAL | True | 12 | 0.365412 | 0.044210 | 0.334112 | 0.018131 |
The get_binned_variable
method serves to retrieve an optimal binning object, which can be analyzed in detail afterward.
[22]:
optb = binning_process.get_binned_variable("NumBank2NatlTradesWHighUtilization")
[23]:
optb.binning_table.build()
[23]:
Bin | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 0.50) | 3416 | 0.408271 | 2184 | 1232 | 0.360656 | 0.662217 | 0.175281 | 0.021518 |
1 | [0.50, 1.50) | 2015 | 0.240827 | 858 | 1157 | 0.574194 | -0.209284 | 0.010461 | 0.001305 |
2 | [1.50, 2.50) | 983 | 0.117485 | 345 | 638 | 0.649034 | -0.525096 | 0.031309 | 0.003869 |
3 | [2.50, 3.50) | 496 | 0.059281 | 137 | 359 | 0.723790 | -0.873644 | 0.041802 | 0.005065 |
4 | [3.50, inf) | 521 | 0.062268 | 139 | 382 | 0.733205 | -0.921249 | 0.048466 | 0.005853 |
5 | Special | 936 | 0.111868 | 333 | 603 | 0.644231 | -0.504077 | 0.027533 | 0.003406 |
6 | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 8367 | 1.000000 | 3996 | 4371 | 0.522409 | 0.334853 | 0.041017 |
[24]:
optb.binning_table.plot(metric="event_rate")
[25]:
optb.binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------
General metrics
Gini index 0.30840174
IV (Jeffrey) 0.33485338
JS (Jensen-Shannon) 0.04101657
Hellinger 0.04143062
Triangular 0.16088993
KS 0.26468885
HHI 0.25839135
HHI (normalized) 0.13478991
Cramer's V 0.28612357
Quality score 0.22212566
Monotonic trend ascending
Significance tests
Bin A Bin B t-statistic p-value P[A > B] P[B > A]
0 1 234.556206 6.049946e-53 5.576352e-107 1.000000
1 2 15.402738 8.686234e-05 7.665294e-06 0.999992
2 3 8.386138 3.780934e-03 1.290940e-03 0.998709
3 4 0.113909 7.357368e-01 3.678894e-01 0.632111