Tutorial: FICO Explainable Machine Learning Challenge¶

In this tutorial, we use the dataset form the FICO Explainable Machine Learning Challenge: https://community.fico.com/s/explainable-machine-learning-challenge. The goal is to create a pipeline by combining a binning process and logistic regression to obtain an explainable model and compare it against a black-box model using Gradient Boosting Tree (GBT) as an estimator.

[1]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

[2]:

from optbinning import BinningProcess

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import auc, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline


[3]:

df = pd.read_csv("data/FICO_challenge/heloc_dataset_v1.csv", sep=",")

[4]:

variable_names = list(df.columns[1:])

[5]:

X = df[variable_names].values


Transform the categorical dichotomic target variable into numerical.

[6]:

y = df.RiskPerformance.values
y = y.astype(int)


Modeling¶

The data dictionary of this challenge includes three special values/codes:

• -9 No Bureau Record or No Investigation

• -8 No Usable/Valid Trades or Inquiries

• -7 Condition not Met (e.g. No Inquiries, No Delinquencies)

[7]:

special_codes = [-9, -8, -7]


This challenge imposes monotonicity constraints with respect to the probability of a bad target for many of the variables. We apply these rules by passing the following dictionary of parameters for these variables involved.

[8]:

binning_fit_params = {
"ExternalRiskEstimate": {"monotonic_trend": "descending"},
"AverageMInFile": {"monotonic_trend": "descending"},
"MSinceMostRecentDelq": {"monotonic_trend": "descending"},
"MSinceMostRecentInqexcl7days": {"monotonic_trend": "descending"},
"NumInqLast6M": {"monotonic_trend": "ascending"},
"NumInqLast6Mexcl7days": {"monotonic_trend": "ascending"},
"NetFractionRevolvingBurden": {"monotonic_trend": "ascending"},
"NetFractionInstallBurden": {"monotonic_trend": "ascending"},
}


Instantiate a BinningProcess object class with variable names, special codes and dictionary of binning parameters. Create a explainable model pipeline and a black-blox pipeline.

[9]:

binning_process = BinningProcess(variable_names, special_codes=special_codes,
binning_fit_params=binning_fit_params)

[10]:

clf1 = Pipeline(steps=[('binning_process', binning_process),
('classifier', LogisticRegression(solver="lbfgs"))])

clf2 = LogisticRegression(solver="lbfgs")



Split dataset into train and test. Fit pipelines with training data, then generate classification reports to show the main classification metrics.

[11]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

[12]:

clf1.fit(X_train, y_train)

/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
return np.log((1. / event_rate - 1) * n_event / n_nonevent)

[12]:

Pipeline(steps=[('binning_process',
BinningProcess(binning_fit_params={'AverageMInFile': {'monotonic_trend': 'descending'},
'ExternalRiskEstimate': {'monotonic_trend': 'descending'},
'MSinceMostRecentDelq': {'monotonic_trend': 'descending'},
'MSinceMostRecentInqexcl7days': {'monotonic_trend': 'descending'},
'MSinceMostRecentInqexcl7days',
'NumInqLast6M',
'NumInqLast6Mexcl7days',
'NetFractionRevolvingBurden',
'NetFractionInstallBurden',
('classifier', LogisticRegression())])

[13]:

clf2.fit(X_train, y_train)

/home/gui/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,

[13]:

LogisticRegression()

[14]:

clf3.fit(X_train, y_train)

[14]:

GradientBoostingClassifier()

[15]:

y_pred = clf1.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

0       0.70      0.66      0.68      1004
1       0.70      0.74      0.72      1088

accuracy                           0.70      2092
macro avg       0.70      0.70      0.70      2092
weighted avg       0.70      0.70      0.70      2092


/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
return np.log((1. / event_rate - 1) * n_event / n_nonevent)

[16]:

y_pred = clf2.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

0       0.67      0.66      0.67      1004
1       0.69      0.70      0.70      1088

accuracy                           0.68      2092
macro avg       0.68      0.68      0.68      2092
weighted avg       0.68      0.68      0.68      2092


[17]:

y_pred = clf3.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

0       0.71      0.66      0.68      1004
1       0.70      0.75      0.73      1088

accuracy                           0.71      2092
macro avg       0.71      0.70      0.70      2092
weighted avg       0.71      0.71      0.70      2092



Plot the Receiver Operating Characteristic (ROC) metric to evaluate and compare the classifiers’ prediction.

[18]:

probs = clf1.predict_proba(X_test)
preds = probs[:,1]
fpr1, tpr1, threshold = roc_curve(y_test, preds)
roc_auc1 = auc(fpr1, tpr1)

probs = clf2.predict_proba(X_test)
preds = probs[:,1]
fpr2, tpr2, threshold = roc_curve(y_test, preds)
roc_auc2 = auc(fpr2, tpr2)

probs = clf3.predict_proba(X_test)
preds = probs[:,1]
fpr3, tpr3, threshold = roc_curve(y_test, preds)
roc_auc3 = auc(fpr3, tpr3)

/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
return np.log((1. / event_rate - 1) * n_event / n_nonevent)

[19]:

plt.title('Receiver Operating Characteristic')
plt.plot(fpr1, tpr1, 'b', label='Binning+LR: AUC = {0:.2f}'.format(roc_auc1))
plt.plot(fpr2, tpr2, 'g', label='LR: AUC = {0:.2f}'.format(roc_auc2))
plt.plot(fpr3, tpr3, 'r', label='GBT: AUC = {0:.2f}'.format(roc_auc3))
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1],'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()


The plot above shows the increment in terms of model performance after binning when the logistic estimator is chosen. Furthermore, a previous binning process might reduce numerical instability issues, as confirmed when fitting the classifier clf2.

Binning process statistics¶

The binning process of the pipeline can be retrieved to show information about the problem and timing statistics.

[20]:

binning_process.information(print_level=2)

optbinning (Version 0.19.0)

Begin options
max_n_prebins                         20   * d
min_prebin_size                     0.05   * d
min_n_bins                            no   * d
max_n_bins                            no   * d
min_bin_size                          no   * d
max_bin_size                          no   * d
max_pvalue                            no   * d
max_pvalue_policy            consecutive   * d
selection_criteria                    no   * d
fixed_variables                       no   * d
categorical_variables                 no   * d
special_codes                        yes   * U
split_digits                          no   * d
binning_fit_params                   yes   * U
binning_transform_params              no   * d
verbose                            False   * d
End options

Statistics
Number of records                   8367
Number of variables                   23
Target type                       binary

Number of numerical                   23
Number of categorical                  0
Number of selected                    23

Time                                2.3069 sec



The summary method returns basic statistics for each binned variable.

[21]:

binning_process.summary()

[21]:

name dtype status selected n_bins iv js gini quality_score
0 ExternalRiskEstimate numerical OPTIMAL True 12 1.018368 0.116638 0.534387 0.031792
1 MSinceOldestTradeOpen numerical OPTIMAL True 11 0.252786 0.030483 0.264740 0.027179
2 MSinceMostRecentTradeOpen numerical OPTIMAL True 6 0.019086 0.002377 0.065597 0.000556
3 AverageMInFile numerical OPTIMAL True 10 0.319379 0.038458 0.304157 0.128082
4 NumSatisfactoryTrades numerical OPTIMAL True 10 0.126726 0.015424 0.180888 0.001210
5 NumTrades60Ever2DerogPubRec numerical OPTIMAL True 4 0.178710 0.021915 0.200184 0.201631
6 NumTrades90Ever2DerogPubRec numerical OPTIMAL True 3 0.133485 0.016301 0.155193 0.286527
7 PercentTradesNeverDelq numerical OPTIMAL True 8 0.377803 0.045428 0.316946 0.101421
8 MSinceMostRecentDelq numerical OPTIMAL True 7 0.289526 0.035246 0.272229 0.239494
9 MaxDelq2PublicRecLast12M numerical OPTIMAL True 3 0.330280 0.040250 0.301670 0.833712
10 MaxDelqEver numerical OPTIMAL True 4 0.236098 0.029129 0.257314 0.667940
11 NumTotalTrades numerical OPTIMAL True 8 0.064716 0.008027 0.138545 0.011755
12 NumTradesOpeninLast12M numerical OPTIMAL True 6 0.023530 0.002936 0.083770 0.007932
13 PercentInstallTrades numerical OPTIMAL True 8 0.098610 0.012107 0.159569 0.077405
14 MSinceMostRecentInqexcl7days numerical OPTIMAL True 4 0.166538 0.020460 0.211639 0.531041
15 NumInqLast6M numerical OPTIMAL True 4 0.089956 0.011127 0.159369 0.323780
16 NumInqLast6Mexcl7days numerical OPTIMAL True 5 0.083992 0.010394 0.153641 0.036291
17 NetFractionRevolvingBurden numerical OPTIMAL True 9 0.574686 0.068232 0.410605 0.343593
18 NetFractionInstallBurden numerical OPTIMAL True 5 0.037879 0.004724 0.105916 0.053723
19 NumRevolvingTradesWBalance numerical OPTIMAL True 7 0.093376 0.011578 0.162108 0.011291
20 NumInstallTradesWBalance numerical OPTIMAL True 5 0.014121 0.001762 0.059437 0.010423
21 NumBank2NatlTradesWHighUtilization numerical OPTIMAL True 5 0.334853 0.041017 0.308402 0.222126
22 PercentTradesWBalance numerical OPTIMAL True 12 0.365412 0.044210 0.334112 0.018131

The get_binned_variable method serves to retrieve an optimal binning object, which can be analyzed in detail afterward.

[22]:

optb = binning_process.get_binned_variable("NumBank2NatlTradesWHighUtilization")

[23]:

optb.binning_table.build()

[23]:

Bin Count Count (%) Non-event Event Event rate WoE IV JS
0 (-inf, 0.50) 3416 0.408271 2184 1232 0.360656 0.662217 0.175281 0.021518
1 [0.50, 1.50) 2015 0.240827 858 1157 0.574194 -0.209284 0.010461 0.001305
2 [1.50, 2.50) 983 0.117485 345 638 0.649034 -0.525096 0.031309 0.003869
3 [2.50, 3.50) 496 0.059281 137 359 0.723790 -0.873644 0.041802 0.005065
4 [3.50, inf) 521 0.062268 139 382 0.733205 -0.921249 0.048466 0.005853
5 Special 936 0.111868 333 603 0.644231 -0.504077 0.027533 0.003406
6 Missing 0 0.000000 0 0 0.000000 0.000000 0.000000 0.000000
Totals 8367 1.000000 3996 4371 0.522409 0.334853 0.041017
[24]:

optb.binning_table.plot(metric="event_rate")

[25]:

optb.binning_table.analysis()

---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

General metrics

Gini index               0.30840174
IV (Jeffrey)             0.33485338
JS (Jensen-Shannon)      0.04101657
Hellinger                0.04143062
Triangular               0.16088993
KS                       0.26468885
HHI                      0.25839135
HHI (normalized)         0.13478991
Cramer's V               0.28612357
Quality score            0.22212566

Monotonic trend             ascending

Significance tests

Bin A  Bin B  t-statistic      p-value      P[A > B]  P[B > A]
0      1   234.556206 6.049946e-53 5.576352e-107  1.000000
1      2    15.402738 8.686234e-05  7.665294e-06  0.999992
2      3     8.386138 3.780934e-03  1.290940e-03  0.998709
3      4     0.113909 7.357368e-01  3.678894e-01  0.632111