Tutorial: Scorecard with binary target¶

In this tutorial, we use the dataset from the FICO Explainable Machine Learning Challenge: https://community.fico.com/s/explainable-machine-learning-challenge. The goal is to develop a scorecard using the logistic regression as an estimator.

[1]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

[2]:

from sklearn.linear_model import LogisticRegression

from optbinning import BinningProcess
from optbinning import Scorecard
from optbinning.scorecard import plot_auc_roc, plot_cap, plot_ks

Download the dataset from the link above and load it.

[3]:

df = pd.read_csv("data/FICO_challenge/heloc_dataset_v1.csv", sep=",")

[4]:

variable_names = list(df.columns[1:])
X = df[variable_names]

Transform the categorical dichotomic target variable into numerical.

[5]:

target = "RiskPerformance"
y = df[target].values
mask = y == "Bad"
y[mask] = 1
y[~mask] = 0
y = y.astype(int)

[6]:

df.head()

[6]:

	RiskPerformance	ExternalRiskEstimate	MSinceOldestTradeOpen	MSinceMostRecentTradeOpen	AverageMInFile	NumSatisfactoryTrades	NumTrades60Ever2DerogPubRec	NumTrades90Ever2DerogPubRec	PercentTradesNeverDelq	MSinceMostRecentDelq	...	PercentInstallTrades	NumInqLast6M	NumInqLast6Mexcl7days	NetFractionRevolvingBurden	NetFractionInstallBurden	NumRevolvingTradesWBalance	NumInstallTradesWBalance	NumBank2NatlTradesWHighUtilization	PercentTradesWBalance
0	1	55	144	4	84	20	3	0	83	2	...	43	0	0	33	-8	8	1	1	69
1	1	61	58	15	41	2	4	4	100	-7	...	67	0	0	0	-8	0	-8	-8	0
2	1	67	66	5	24	9	0	0	100	-7	...	44	4	4	53	66	4	2	1	86
3	1	66	169	1	73	28	1	1	93	76	...	57	5	4	72	83	6	4	3	91
4	1	81	333	27	132	12	0	0	100	-7	...	25	1	1	51	89	3	1	0	80

5 rows × 24 columns

Scorecard development¶

This dataset includes three special values/codes:

-9 No Bureau Record or No Investigation
-8 No Usable/Valid Trades or Inquiries
-7 Condition not Met (e.g. No Inquiries, No Delinquencies)

[7]:

special_codes = [-9, -8, -7]

We specify a selection criteria in terms of the Information Value (IV) predictiveness and minimum quality score to remove low-quality variables. Then, we instantiate a BinningProcess object class with variable names, special codes and selection criteria.

[8]:

selection_criteria = {
    "iv": {"min": 0.02, "max": 1},
    "quality_score": {"min": 0.01}
}

[9]:

binning_process = BinningProcess(variable_names, special_codes=special_codes,
                                 selection_criteria=selection_criteria)

We select as an estimator a logistic regression to be solved using the non-linear solver L-BFGS-B.

[10]:

estimator = LogisticRegression(solver="lbfgs")

Finally, we instantiate a Scorecard class with the target name, a binning process object, and an estimator. In addition, we want to apply a scaling method to the scorecard points.

[11]:

scorecard = Scorecard(binning_process=binning_process,
                      estimator=estimator, scaling_method="min_max",
                      scaling_method_params={"min": 300, "max": 850})

[12]:

scorecard.fit(X, y, show_digits=4)

/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
  return np.log((1. / event_rate - 1) * n_event / n_nonevent)

[12]:

Scorecard(binning_process=BinningProcess(selection_criteria={'iv': {'max': 1,
                                                                    'min': 0.02},
                                                             'quality_score': {'min': 0.01}},
                                         special_codes=[-9, -8, -7],
                                         variable_names=['ExternalRiskEstimate',
                                                         'MSinceOldestTradeOpen',
                                                         'MSinceMostRecentTradeOpen',
                                                         'AverageMInFile',
                                                         'NumSatisfactoryTrades',
                                                         'NumTrades60Ever2DerogPubRec',
                                                         'NumTrades90Ever2DerogPubRec',
                                                         'PercentTradesNe...
                                                         'PercentInstallTrades',
                                                         'MSinceMostRecentInqexcl7days',
                                                         'NumInqLast6M',
                                                         'NumInqLast6Mexcl7days',
                                                         'NetFractionRevolvingBurden',
                                                         'NetFractionInstallBurden',
                                                         'NumRevolvingTradesWBalance',
                                                         'NumInstallTradesWBalance',
                                                         'NumBank2NatlTradesWHighUtilization',
                                                         'PercentTradesWBalance']),
          estimator=LogisticRegression(), scaling_method='min_max',
          scaling_method_params={'max': 850, 'min': 300})

Similar to other objects in OptBinning, we can print overview information about the options settings, problems statistics, and the number of selected variables after the binning process. With these settings, using the selection criteria, 4 variables are removed.

[13]:

scorecard.information(print_level=2)

optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Begin options
    binning_process                      yes   * U
    estimator                            yes   * U
    scaling_method                   min_max   * U
    scaling_method_params                yes   * U
    intercept_based                    False   * d
    reverse_scorecard                  False   * d
    rounding                           False   * d
    verbose                            False   * d
  End options

  Statistics
    Number of records                  10459
    Number of variables                   23
    Target type                       binary

    Number of numerical                   23
    Number of categorical                  0
    Number of selected                    19

  Timing
    Total time                          3.28 sec
    Binning process                     2.62 sec   ( 79.92%)
    Estimator                           0.25 sec   (  7.70%)
    Build scorecard                     0.41 sec   ( 12.36%)
      rounding                          0.00 sec   (  0.00%)

The method table returns the scorecard table. A scorecard table has a wide range of real-world business applications, being an interpretable tool to summarize relationships among variables. The scorecard table can handle binary and continuous targets. Two scorecard styles are available: style="summary" shows the variable name, and their corresponding bins and assigned points; style="detailed" adds information from the corresponding binning table.

[14]:

scorecard.table(style="summary")

[14]:

	Variable	Bin	Points
0	ExternalRiskEstimate	(-inf, 59.5000)	5.359275
1	ExternalRiskEstimate	[59.5000, 63.5000)	11.598078
2	ExternalRiskEstimate	[63.5000, 65.5000)	18.168973
3	ExternalRiskEstimate	[65.5000, 67.5000)	19.821705
4	ExternalRiskEstimate	[67.5000, 70.5000)	25.498720
...	...	...	...
8	PercentTradesWBalance	[80.5000, 87.5000)	32.310289
9	PercentTradesWBalance	[87.5000, 98.0000)	32.026880
10	PercentTradesWBalance	[98.0000, inf)	31.928758
11	PercentTradesWBalance	Special	32.738612
12	PercentTradesWBalance	Missing	32.738612

164 rows × 3 columns

[15]:

scorecard.table(style="detailed")

[15]:

	Variable	Bin id	Bin	Count	Count (%)	Non-event	Event	Event rate	WoE	IV	JS	Coefficient	Points
0	ExternalRiskEstimate	0	(-inf, 59.5000)	1081	0.103356	166	915	0.846438	-1.619109	0.217629	0.024574	-0.327969	5.359275
1	ExternalRiskEstimate	1	[59.5000, 63.5000)	1097	0.104886	228	869	0.792160	-1.250170	0.142003	0.016678	-0.327969	11.598078
2	ExternalRiskEstimate	2	[63.5000, 65.5000)	681	0.065111	190	491	0.720999	-0.861592	0.044754	0.005427	-0.327969	18.168973
3	ExternalRiskEstimate	3	[65.5000, 67.5000)	652	0.062339	195	457	0.700920	-0.763856	0.034156	0.004169	-0.327969	19.821705
4	ExternalRiskEstimate	4	[67.5000, 70.5000)	1038	0.099245	388	650	0.626204	-0.428139	0.017755	0.002203	-0.327969	25.498720
...	...	...	...	...	...	...	...	...	...	...	...	...	...
8	PercentTradesWBalance	8	[80.5000, 87.5000)	797	0.076202	283	514	0.644918	-0.508949	0.019114	0.002364	-0.016322	32.310289
9	PercentTradesWBalance	9	[87.5000, 98.0000)	652	0.062339	184	468	0.717791	-0.845705	0.041380	0.005024	-0.016322	32.026880
10	PercentTradesWBalance	10	[98.0000, inf)	1277	0.122096	331	946	0.740799	-0.962296	0.103054	0.012407	-0.016322	31.928758
11	PercentTradesWBalance	11	Special	606	0.057941	269	337	0.556106	-0.137544	0.001091	0.000136	-0.016322	32.738612
12	PercentTradesWBalance	12	Missing	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000	-0.016322	32.738612

164 rows × 13 columns

We can check the correctness of the scaling method as follows

[16]:

sc = scorecard.table(style="summary")
sc.groupby("Variable").agg({'Points' : [np.min, np.max]}).sum()

[16]:

Points  amin    300.0
        amax    850.0
dtype: float64

Scorecard performance¶

Compute predicted probabilities of the fitted estimator.

[17]:

y_pred = scorecard.predict_proba(X)[:, 1]

/home/gui/projects/github/top/optbinning/optbinning/binning/transformations.py:38: RuntimeWarning: invalid value encountered in log
  return np.log((1. / event_rate - 1) * n_event / n_nonevent)

Plot Area Under the Receiver Operating Characteristic Curve (AUC ROC).

[18]:

plot_auc_roc(y, y_pred)

../_images/tutorials_tutorial_scorecard_binary_target_32_0.png

Plot Cumulative Accuracy Profile (CAP).

[19]:

plot_cap(y, y_pred)

../_images/tutorials_tutorial_scorecard_binary_target_34_0.png

Plot Kolmogorov-Smirnov (KS).

[20]:

plot_ks(y, y_pred)

../_images/tutorials_tutorial_scorecard_binary_target_36_0.png

Calculate the score of the dataset and plot distribution of scores for event and non-event records.

[21]:

score = scorecard.score(X)

[22]:

mask = y == 0
plt.hist(score[mask], label="non-event", color="b", alpha=0.35)
plt.hist(score[~mask], label="event", color="r", alpha=0.35)
plt.xlabel("score")
plt.legend()
plt.show()

../_images/tutorials_tutorial_scorecard_binary_target_39_0.png