Tutorial: FICO Explainable Machine Learning Challenge - Updating Binning

In this tutorial, we extend the previous tutorial using the FICO dataset by replacing the usual binning with a piecewise continuous binning. The piecewise continuous binning uses a Gradient Boosting Tree (GBT) as an estimator.

[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
[2]:
from lightgbm import LGBMClassifier

from optbinning import BinningProcess
from optbinning import OptimalPWBinning

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import auc, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split

Download the dataset from the link above and load it: https://community.fico.com/s/explainable-machine-learning-challenge.

[3]:
df = pd.read_csv("data/FICO_challenge/heloc_dataset_v1.csv", sep=",")

variable_names = list(df.columns[1:])

X = df[variable_names]

Transform the categorical dichotomic target variable into numerical.

[4]:
y = df.RiskPerformance.values
mask = y == "Bad"
y[mask] = 1
y[~mask] = 0
y = y.astype(int)

Modeling

The data dictionary of this challenge includes three special values/codes:

  • -9 No Bureau Record or No Investigation

  • -8 No Usable/Valid Trades or Inquiries

  • -7 Condition not Met (e.g. No Inquiries, No Delinquencies)

[5]:
special_codes = [-9, -8, -7]

This challenge imposes monotonicity constraints with respect to the probability of a bad target for many of the variables. We apply these rules by passing the following dictionary of parameters for these variables involved.

[6]:
binning_fit_params = {
    "ExternalRiskEstimate": {"monotonic_trend": "descending"},
    "MSinceOldestTradeOpen": {"monotonic_trend": "descending"},
    "MSinceMostRecentTradeOpen": {"monotonic_trend": "descending"},
    "AverageMInFile": {"monotonic_trend": "descending"},
    "NumSatisfactoryTrades": {"monotonic_trend": "descending"},
    "NumTrades60Ever2DerogPubRec": {"monotonic_trend": "ascending"},
    "NumTrades90Ever2DerogPubRec": {"monotonic_trend": "ascending"},
    "PercentTradesNeverDelq": {"monotonic_trend": "descending"},
    "MSinceMostRecentDelq": {"monotonic_trend": "descending"},
    "NumTradesOpeninLast12M": {"monotonic_trend": "ascending"},
    "MSinceMostRecentInqexcl7days": {"monotonic_trend": "descending"},
    "NumInqLast6M": {"monotonic_trend": "ascending"},
    "NumInqLast6Mexcl7days": {"monotonic_trend": "ascending"},
    "NetFractionRevolvingBurden": {"monotonic_trend": "ascending"},
    "NetFractionInstallBurden": {"monotonic_trend": "ascending"},
    "NumBank2NatlTradesWHighUtilization": {"monotonic_trend": "ascending"}
}

Instantiate a BinningProcess object class with variable names, special codes and dictionary of binning parameters. Choose a logistic regression as a classifier.

[7]:
binning_process = BinningProcess(variable_names, special_codes=special_codes,
                                 binning_fit_params=binning_fit_params)
[8]:
clf = LogisticRegression(solver="lbfgs")

Split dataset into train and test. Fit pipelines with training data, then generate classification reports to show the main classification metrics.

[9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
[10]:
binning_process.fit(X_train, y_train)
[10]:
BinningProcess(binning_fit_params={'AverageMInFile': {'monotonic_trend': 'descending'},
                                   'ExternalRiskEstimate': {'monotonic_trend': 'descending'},
                                   'MSinceMostRecentDelq': {'monotonic_trend': 'descending'},
                                   'MSinceMostRecentInqexcl7days': {'monotonic_trend': 'descending'},
                                   'MSinceMostRecentTradeOpen': {'monotonic_trend': 'descending'},
                                   'MSinceOldestTradeOpen': {'mo...
                               'MaxDelq2PublicRecLast12M', 'MaxDelqEver',
                               'NumTotalTrades', 'NumTradesOpeninLast12M',
                               'PercentInstallTrades',
                               'MSinceMostRecentInqexcl7days', 'NumInqLast6M',
                               'NumInqLast6Mexcl7days',
                               'NetFractionRevolvingBurden',
                               'NetFractionInstallBurden',
                               'NumRevolvingTradesWBalance',
                               'NumInstallTradesWBalance',
                               'NumBank2NatlTradesWHighUtilization',
                               'PercentTradesWBalance'])

Now, we replace the usual binning of a few numerical variables with a piecewise continuous binning. Since version 0.9.2, the binning process includes the method update_binned_variable which allows updating an optimal binning without the need of re-processing the rest of the variables.

[11]:
update_variables = ["ExternalRiskEstimate", "MSinceOldestTradeOpen", "PercentTradesWBalance"]

for variable in update_variables:
    optb = OptimalPWBinning(estimator=LGBMClassifier(),
                            name=variable, objective="l1")
    optb.fit(X_train[variable], y_train, lb=0.001, ub=0.999)
    binning_process.update_binned_variable(name=variable, optb=optb)

Performance

[12]:
clf.fit(binning_process.transform(X_train), y_train)
/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
  return np.log((1. / event_rate - 1) * n_event / n_nonevent)
[12]:
LogisticRegression()
[13]:
y_pred = clf.predict(binning_process.transform(X_test))
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.71      0.67      0.69      1004
           1       0.71      0.74      0.73      1088

    accuracy                           0.71      2092
   macro avg       0.71      0.71      0.71      2092
weighted avg       0.71      0.71      0.71      2092

/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
  return np.log((1. / event_rate - 1) * n_event / n_nonevent)

If we compare with the results from the previous tutorial, we observe a slight improvement in all three metrics.

[14]:
probs = clf.predict_proba(binning_process.transform(X_test))
preds = probs[:,1]
fpr1, tpr1, threshold = roc_curve(y_test, preds)
roc_auc1 = auc(fpr1, tpr1)
/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
  return np.log((1. / event_rate - 1) * n_event / n_nonevent)
[15]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr1, tpr1, 'b', label='Binning+LR: AUC = {0:.4f}'.format(roc_auc1))
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1],'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
../_images/tutorials_tutorial_binning_process_FICO_update_binning_26_0.png

Finally, let’s check the piecewise continuous binning for one of the variables with more importance.

[16]:
optb = binning_process.get_binned_variable("ExternalRiskEstimate")
optb.binning_table.build()
[16]:
Bin Count Count (%) Non-event Event c0 c1
0 (-inf, 63.50) 2213 0.264491 510 1703 0.799992 -0.000000
1 [63.50, 65.50) 563 0.067288 153 410 4.297026 -0.055071
2 [65.50, 67.50) 516 0.061671 157 359 1.785820 -0.016732
3 [67.50, 70.50) 824 0.098482 307 517 2.741859 -0.030896
4 [70.50, 72.50) 530 0.063344 258 272 2.295900 -0.024570
5 [72.50, 74.50) 530 0.063344 265 265 2.285756 -0.024430
6 [74.50, 77.50) 709 0.084738 423 286 3.568860 -0.041653
7 [77.50, 81.50) 864 0.103263 600 264 1.746868 -0.018144
8 [81.50, 84.50) 602 0.071949 461 141 2.452263 -0.026799
9 [84.50, 87.50) 518 0.061910 437 81 1.635270 -0.017130
10 [87.50, inf) 498 0.059520 425 73 0.136375 0.000000
11 Special 0 0.000000 0 0 0.000000 0.000000
12 Missing 0 0.000000 0 0 0.000000 0.000000
Totals 8367 1.000000 3996 4371 - -
[17]:
optb.binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

  General metrics

    Gini index               0.49858632
    IV (Jeffrey)             1.02947580
    JS (Jensen-Shannon)      0.11880544
    Hellinger                0.12345593
    Triangular               0.44355312
    KS                       0.40217895
    Avg precision            0.69910615
    Brier score              0.19906996
    HHI                      0.12640619
    HHI (normalized)         0.05360671
    Cramer's V               0.45085066
    Quality score            0.01103754

  Significance tests

    Bin A  Bin B  t-statistic  p-value  P[A > B]  P[B > A]
        0      1     4.273551 0.038710  0.979402  0.020598
        1      2     1.388516 0.238656  0.880875  0.119125
        2      3     6.506519 0.010748  0.995084  0.004916
        3      4    17.292596 0.000032  0.999996  0.000004
        4      5     0.184590 0.667458  0.666430  0.333570
        5      6    11.450667 0.000715  0.999713  0.000287
        6      7    16.377781 0.000052  0.999986  0.000014
        7      8     8.982480 0.002726  0.999028  0.000972
        8      9    10.554598 0.001159  0.999629  0.000371
        9     10     0.186908 0.665503  0.667979  0.332021

[18]:
optb.binning_table.plot(metric="event_rate")
../_images/tutorials_tutorial_binning_process_FICO_update_binning_30_0.png
[19]:
binning_process.summary()
[19]:
name dtype status selected n_bins iv js gini quality_score
0 ExternalRiskEstimate numerical OPTIMAL True 11 1.029476 0.118805 0.498586 0.011038
1 MSinceOldestTradeOpen numerical OPTIMAL True 10 0.228822 0.028029 0.241497 0.013902
2 MSinceMostRecentTradeOpen numerical OPTIMAL True 6 0.019086 0.002377 0.065597 0.000556
3 AverageMInFile numerical OPTIMAL True 10 0.319379 0.038458 0.304157 0.128082
4 NumSatisfactoryTrades numerical OPTIMAL True 10 0.126726 0.015424 0.180888 0.001210
5 NumTrades60Ever2DerogPubRec numerical OPTIMAL True 4 0.178710 0.021915 0.200184 0.201631
6 NumTrades90Ever2DerogPubRec numerical OPTIMAL True 3 0.133485 0.016301 0.155193 0.286527
7 PercentTradesNeverDelq numerical OPTIMAL True 8 0.377803 0.045428 0.316946 0.101421
8 MSinceMostRecentDelq numerical OPTIMAL True 7 0.289526 0.035246 0.272229 0.239494
9 MaxDelq2PublicRecLast12M numerical OPTIMAL True 3 0.330280 0.040250 0.301670 0.833712
10 MaxDelqEver numerical OPTIMAL True 4 0.236098 0.029129 0.257314 0.667940
11 NumTotalTrades numerical OPTIMAL True 8 0.064716 0.008027 0.138545 0.011755
12 NumTradesOpeninLast12M numerical OPTIMAL True 6 0.023530 0.002936 0.083770 0.007932
13 PercentInstallTrades numerical OPTIMAL True 8 0.098610 0.012107 0.159569 0.077405
14 MSinceMostRecentInqexcl7days numerical OPTIMAL True 4 0.166538 0.020460 0.211639 0.531041
15 NumInqLast6M numerical OPTIMAL True 4 0.089956 0.011127 0.159369 0.323780
16 NumInqLast6Mexcl7days numerical OPTIMAL True 5 0.083992 0.010394 0.153641 0.036291
17 NetFractionRevolvingBurden numerical OPTIMAL True 9 0.574686 0.068232 0.410605 0.343593
18 NetFractionInstallBurden numerical OPTIMAL True 5 0.037879 0.004724 0.105916 0.053723
19 NumRevolvingTradesWBalance numerical OPTIMAL True 7 0.093376 0.011578 0.162108 0.011291
20 NumInstallTradesWBalance numerical OPTIMAL True 5 0.014121 0.001762 0.059437 0.010423
21 NumBank2NatlTradesWHighUtilization numerical OPTIMAL True 5 0.334853 0.041017 0.308402 0.222126
22 PercentTradesWBalance numerical OPTIMAL True 13 0.379001 0.045810 0.333979 0.018258