Tutorial: Scorecard monitoring

This tutorial is a continuation of the two previous scorecard tutorials, where we focus on scorecard monitoring. Scorecard monitoring is important to determine if the distribution of new data has shifted with respect to the original data used to develop the scorecard. Besides, monitoring is also useful to detect errors in raw data and track scorecard performance.

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import HuberRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from optbinning import BinningProcess
from optbinning import Scorecard
from optbinning.scorecard import ScorecardMonitoring

Binary target

We use the Home equity line of credit (HELOC) dataset from FICO Explainable Machine Learning Challenge https://community.fico.com/s/explainable-machine-learning-challenge.

df = pd.read_csv("data/FICO_challenge/heloc_dataset_v1.csv", sep=",")

variable_names = list(df.columns[1:])

target = "RiskPerformance"
y = df[target].values
mask = y == "Bad"
y[mask] = 1
y[~mask] = 0
y = y.astype(int)

X = df[variable_names]

For this example, we split data to compare the robustness of the developed scorecard in the test dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

As in the previous example, we specify a list of special codes and a selection criteria to be applied in the binning process.

special_codes = [-9, -8, -7]
selection_criteria = {
    "iv": {"min": 0.02, "max": 1},
    "quality_score": {"min": 0.01}
binning_process = BinningProcess(variable_names, special_codes=special_codes,
estimator = LogisticRegression(solver="lbfgs")

Now, we instantiate a Scorecard class with the target name, a binning process object, and an estimator, and fit with training data.

scorecard = Scorecard(binning_process=binning_process,
                      estimator=estimator, scaling_method="min_max",
                      scaling_method_params={"min": 0, "max": 100})
scorecard.fit(X_train, y_train, metric_special="empirical", metric_missing="empirical")
Scorecard(binning_process=BinningProcess(selection_criteria={'iv': {'max': 1,
                                                                    'min': 0.02},
                                                             'quality_score': {'min': 0.01}},
                                         special_codes=[-9, -8, -7],
          estimator=LogisticRegression(), scaling_method='min_max',
          scaling_method_params={'max': 100, 'min': 0})
  Begin options
    binning_process                      yes   * U
    estimator                            yes   * U
    scaling_method                   min_max   * U
    scaling_method_params                yes   * U
    intercept_based                    False   * d
    reverse_scorecard                  False   * d
    rounding                           False   * d
    verbose                            False   * d
  End options

    Number of records                   7321
    Number of variables                   23
    Target type                       binary

    Number of numerical                   23
    Number of categorical                  0
    Number of selected                    21

    Total time                          3.03 sec
    Binning process                     2.52 sec   ( 83.32%)
    Estimator                           0.21 sec   (  6.86%)
    Build scorecard                     0.30 sec   (  9.80%)
      rounding                          0.00 sec   (  0.00%)

Once the scorecard is fitted, we use the ScorecardMonitoring class to ensure that the resulting scorecard is discriminating using train and test data. Furthermore, this class analyzes whether the distribution of train and test data differ significantly. In practice, df_train would be the (expected) data used for scorecard development, whereas df_test would be the (actual) evolved data.

monitoring = ScorecardMonitoring(scorecard=scorecard, psi_method="cart",
                                 psi_n_bins=10, verbose=True)
monitoring.fit(X_test, y_test, X_train, y_train)
2024-01-15 00:25:29,298 | INFO : Monitoring started.
2024-01-15 00:25:29,301 | INFO : Options: check parameters.
2024-01-15 00:25:29,304 | INFO : System stability analysis started.
2024-01-15 00:25:29,693 | INFO : System stability analysis terminated. Time: 0.3885s
2024-01-15 00:25:29,696 | INFO : Variable analysis started.
2024-01-15 00:25:29,914 | INFO : Variable analysis terminated. Time: 0.2160s
2024-01-15 00:25:29,916 | INFO : Monitoring terminated. Time: 0.6172s
                    scorecard=Scorecard(binning_process=BinningProcess(selection_criteria={'iv': {'max': 1,
                                                                                                  'min': 0.02},
                                                                                           'quality_score': {'min': 0.01}},
                                        scaling_method_params={'max': 100,
                                                               'min': 0}),

Similar to other objects in OptBinning, we can print overview information about the options settings, data statistics and CPU times.

  Begin options
    scorecard                            yes   * U
    psi_method                          cart   * d
    psi_n_bins                            10   * U
    psi_min_bin_size                    0.05   * d
    show_digits                            2   * d
    verbose                             True   * U
  End options

    Number of records Actual            3138
    Number of records Expected          7321
    Number of scorecard variables         21
    Target type                       binary

    Total time                          0.62 sec
    System stability                    0.39 sec   ( 62.94%)
    Variables stability                 0.22 sec   ( 34.99%)

The method psi_table returns the Population Stability Index (PSI) table. The PSI is a divergence measure equivalent to the Information Value (IV), also known as Jeffry’s divergence. This measure assesses whether the actual score distribution has shifted from the expected score distribution. This analysis requires the segmentation of the score with respect to the target, using the options psi_method, psi_n_bins and psi_min_bin_size.

Bin Count A Count E Count A (%) Count E (%) PSI
0 (-inf, 41.29) 236 565 0.075207 0.077175 0.000051
1 [41.29, 46.67) 334 803 0.106437 0.109684 0.000098
2 [46.67, 51.12) 331 807 0.105481 0.110231 0.000209
3 [51.12, 54.76) 310 729 0.098789 0.099577 0.000006
4 [54.76, 57.78) 416 1015 0.132569 0.138642 0.000272
5 [57.78, 60.92) 269 565 0.085723 0.077175 0.000898
6 [60.92, 63.28) 185 417 0.058955 0.056959 0.000069
7 [63.28, 67.20) 298 752 0.094965 0.102718 0.000608
8 [67.20, 72.25) 407 823 0.129700 0.112416 0.002472
9 [72.25, inf) 352 845 0.112173 0.115421 0.000093
Totals 3138 7321 1.000000 1.000000 0.004776

We can plot the PSI table using method psi_plot, where the population distribution and event rate for each bin (Bin ID) are shown.


This analysis computes statistical tests to determine if the event rate on train and test data are significantly different using the Chi-square test. The null hypothesis is that actual = expected.

Bin Count A Count E Event rate A Event rate E statistic p-value
0 (-inf, 41.29) 236 565 0.915254 0.916814 0.005285 0.942048
1 [41.29, 46.67) 334 803 0.820359 0.851806 1.758519 0.184809
2 [46.67, 51.12) 331 807 0.758308 0.768278 0.129913 0.718522
3 [51.12, 54.76) 310 729 0.709677 0.662551 2.207654 0.137327
4 [54.76, 57.78) 416 1015 0.562500 0.565517 0.010927 0.916745
5 [57.78, 60.92) 269 565 0.464684 0.500885 0.955733 0.328264
6 [60.92, 63.28) 185 417 0.464865 0.429257 0.659372 0.416782
7 [63.28, 67.20) 298 752 0.322148 0.293883 0.808999 0.368416
8 [67.20, 72.25) 407 823 0.221130 0.196841 0.986264 0.320657
9 [72.25, inf) 352 845 0.130682 0.114793 0.596356 0.439972

Finally, the system_stability_report method summarizes the findings encountered throughout the analysis. Moreover, it returns a comparison with the performance of the provided scorecard on both train and test data, to identify if the developed scorecard suffers a deterioration in predictiveness on the actual/evolved data.

Monitoring: System Stability Report

  Population Stability Index (PSI)

    PSI total:      0.0048 (No significant change)

         PSI bin  Count  Count (%)
    [0.00, 0.10)     10        1.0
    [0.10, 0.25)      0        0.0
    [0.25, Inf+)      0        0.0

  Significance tests (H0: actual == expected)

     p-value bin  Count  Count (%)
    [0.00, 0.05)      0        0.0
    [0.05, 0.10)      0        0.0
    [0.10, 0.50)      7        0.7
    [0.50, 1.00)      3        0.3

  Target analysis

               Metric  Actual Actual (%)  Expected Expected (%)
    Number of records    3138          -      7321            -
        Event records    1638   0.521989      3821     0.521923
    Non-event records    1500   0.478011      3500     0.478077

  Performance metrics

                 Metric   Actual  Expected  Diff A - E
     True positive rate 0.755189  0.770741   -0.015551
     True negative rate 0.696000  0.681429    0.014571
    False positive rate 0.304000  0.318571   -0.014571
    False negative rate 0.244811  0.229259    0.015551
      Balanced accuracy 0.725595  0.726085   -0.000490
     Discriminant power 1.077740  1.087685   -0.009945
                   Gini 0.587042  0.604119   -0.017077

The ScorecardMonitoring also implements functionalities to perform the characteristic stability report. The psi_variable_tablemethod returns the PSI using the optimal bins incorporated in the scorecard at a characteristic level.

Variable Bin Count A Count E Count A (%) Count E (%) PSI
0 ExternalRiskEstimate (-inf, 59.50) 494 1185 0.157425 0.161863 0.000123
1 ExternalRiskEstimate [59.50, 63.50) 326 771 0.103888 0.105313 0.000019
2 ExternalRiskEstimate [63.50, 65.50) 193 488 0.061504 0.066658 0.000415
3 ExternalRiskEstimate [65.50, 67.50) 205 447 0.065328 0.061057 0.000289
4 ExternalRiskEstimate [67.50, 70.50) 302 736 0.096240 0.100533 0.000187
... ... ... ... ... ... ... ...
5 PercentTradesWBalance [67.50, 73.50) 216 516 0.068834 0.070482 0.000039
6 PercentTradesWBalance [73.50, 75.50) 178 382 0.056724 0.052179 0.000380
7 PercentTradesWBalance [75.50, 84.50) 328 851 0.104525 0.116241 0.001245
8 PercentTradesWBalance [84.50, 89.50) 199 434 0.063416 0.059282 0.000279
9 PercentTradesWBalance [89.50, inf) 481 1134 0.153282 0.154897 0.000017

135 rows × 7 columns

Variable PSI
0 AverageMInFile 0.004087
1 ExternalRiskEstimate 0.003432
2 MSinceMostRecentDelq 0.001042
3 MSinceMostRecentInqexcl7days 0.001249
4 MSinceMostRecentTradeOpen 0.000180
5 MSinceOldestTradeOpen 0.002839
6 MaxDelq2PublicRecLast12M 0.000514
7 MaxDelqEver 0.000379
8 NetFractionInstallBurden 0.003289
9 NetFractionRevolvingBurden 0.004657
10 NumBank2NatlTradesWHighUtilization 0.000867
11 NumInqLast6M 0.001527
12 NumInqLast6Mexcl7days 0.001399
13 NumRevolvingTradesWBalance 0.001579
14 NumTotalTrades 0.001619
15 NumTrades60Ever2DerogPubRec 0.001896
16 NumTrades90Ever2DerogPubRec 0.000495
17 NumTradesOpeninLast12M 0.001805
18 PercentInstallTrades 0.001951
19 PercentTradesNeverDelq 0.002862
20 PercentTradesWBalance 0.003316

Continuous target

Similar monitoring is available for a scorecard with a continuous target.

data = fetch_california_housing()

target = "target"
variable_names = data.feature_names
X = pd.DataFrame(data.data, columns=variable_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
binning_process = BinningProcess(variable_names)
estimator = HuberRegressor(max_iter=200)
scorecard = Scorecard(binning_process=binning_process,
                      estimator=estimator, scaling_method="min_max",
                      scaling_method_params={"min": 0, "max": 100},
scorecard.fit(X_train, y_train)
Scorecard(binning_process=BinningProcess(variable_names=['MedInc', 'HouseAge',
                                                         'AveOccup', 'Latitude',
          estimator=HuberRegressor(max_iter=200), reverse_scorecard=True,
          scaling_method_params={'max': 100, 'min': 0})
monitoring = ScorecardMonitoring(scorecard=scorecard, psi_method="cart",

monitoring.fit(X_test, y_test, X_train, y_train)
                                        scaling_method_params={'max': 100,
                                                               'min': 0}))
  Begin options
    scorecard                            yes   * U
    psi_method                          cart   * d
    psi_n_bins                            10   * U
    psi_min_bin_size                    0.05   * d
    show_digits                            2   * d
    verbose                            False   * d
  End options

    Number of records Actual            6192
    Number of records Expected         14448
    Number of scorecard variables          8
    Target type                   continuous

    Total time                          0.22 sec
    System stability                    0.13 sec   ( 59.21%)
    Variables stability                 0.09 sec   ( 40.35%)

Bin Count A Count E Count A (%) Count E (%) PSI
0 (-inf, 49.51) 318 725 0.051357 0.050180 0.000027
1 [49.51, 51.67) 458 1157 0.073966 0.080080 0.000486
2 [51.67, 53.68) 527 1171 0.085110 0.081049 0.000198
3 [53.68, 56.56) 861 2022 0.139050 0.139950 0.000006
4 [56.56, 59.35) 907 2093 0.146479 0.144864 0.000018
5 [59.35, 60.85) 516 1162 0.083333 0.080426 0.000103
6 [60.85, 63.37) 830 1911 0.134044 0.132267 0.000024
7 [63.37, 66.13) 665 1531 0.107397 0.105966 0.000019
8 [66.13, 70.97) 586 1377 0.094638 0.095307 0.000005
9 [70.97, inf) 524 1299 0.084625 0.089909 0.000320
Totals 6192 14448 1.000000 1.000000 0.001206

This analysis computes statistical tests to determine if the mean on train and test data are significantly different using the Student’s t-test. The null hypothesis is that actual = expected.

Bin Count A Count E Mean A Mean E Std A Std E statistic p-value
0 (-inf, 49.51) 318 725 0.800129 0.794417 0.335265 0.389576 0.240789 0.809789
1 [49.51, 51.67) 458 1157 1.035358 1.060722 0.481755 0.441926 -0.975857 0.329439
2 [51.67, 53.68) 527 1171 1.253723 1.235909 0.513648 0.473194 0.677242 0.498419
3 [53.68, 56.56) 861 2022 1.416871 1.403359 0.597450 0.576907 0.561459 0.574565
4 [56.56, 59.35) 907 2093 1.659644 1.686652 0.629067 0.680155 -1.053431 0.292281
5 [59.35, 60.85) 516 1162 1.953297 1.909709 0.697145 0.668803 1.196619 0.231753
6 [60.85, 63.37) 830 1911 2.311459 2.237499 0.783756 0.772066 2.280284 0.022726
7 [63.37, 66.13) 665 1531 2.635100 2.653461 0.848958 0.855882 -0.464558 0.642328
8 [66.13, 70.97) 586 1377 3.183081 3.145512 0.920088 0.910585 0.830408 0.406490
9 [70.97, inf) 524 1299 4.084080 4.148884 0.927028 0.879090 -1.370776 0.170778
Monitoring: System Stability Report

  Population Stability Index (PSI)

    PSI total:      0.0012 (No significant change)

         PSI bin  Count  Count (%)
    [0.00, 0.10)     10        1.0
    [0.10, 0.25)      0        0.0
    [0.25, Inf+)      0        0.0

  Significance tests (H0: actual == expected)

     p-value bin  Count  Count (%)
    [0.00, 0.05)      1        0.1
    [0.05, 0.10)      0        0.0
    [0.10, 0.50)      6        0.6
    [0.50, 1.00)      3        0.3

  Target analysis

    Metric   Actual  Expected
      Mean 2.066968  2.069240
       Std 1.145661  1.157452
       p25 1.202750  1.193000
    Median 1.810000  1.793000
       p75 2.650500  2.646000

  Performance metrics

                   Metric    Actual  Expected  Diff A - E
      Mean absolute error  0.520695  0.516443    0.004253
       Mean squared error  0.509314  0.502051    0.007263
    Median absolute error  0.392306  0.382863    0.009443
       Explained variance  0.616266  0.628663   -0.012397
                      R^2  0.611963  0.625250   -0.013287
                      MPE -0.080802 -0.082159    0.001358
                     MAPE  0.300370  0.297225    0.003145
                    SMAPE  0.136579  0.135519    0.001060
                    MdAPE  0.213391  0.210681    0.002710
                   SMdAPE  0.107431  0.105786    0.001646

Variable Bin Count A Count E Count A (%) Count E (%) PSI
0 MedInc (-inf, 1.82) 534 1253 0.086240 0.086725 0.000003
1 MedInc [1.82, 2.24) 533 1252 0.086079 0.086656 0.000004
2 MedInc [2.24, 2.57) 511 1112 0.082526 0.076966 0.000388
3 MedInc [2.57, 2.83) 409 997 0.066053 0.069006 0.000129
4 MedInc [2.83, 3.07) 383 876 0.061854 0.060631 0.000024
... ... ... ... ... ... ... ...
2 Longitude [-122.12, -121.45) 815 1923 0.131621 0.133098 0.000016
3 Longitude [-121.45, -120.69) 485 1180 0.078327 0.081672 0.000140
4 Longitude [-120.69, -119.76) 303 727 0.048934 0.050318 0.000039
5 Longitude [-119.76, -118.91) 363 858 0.058624 0.059385 0.000010
6 Longitude [-118.91, inf) 3297 7542 0.532461 0.522010 0.000207

72 rows × 7 columns

Variable PSI
0 AveBedrms 0.001911
1 AveOccup 0.003518
2 AveRooms 0.001540
3 HouseAge 0.001855
4 Latitude 0.003628
5 Longitude 0.000635
6 MedInc 0.001032
7 Population 0.000904