Tutorial: optimal binning with binary target - large scale

Continuing with the previous tutorial, version 0.4.0 introduced four new monotonic_trend options: “auto_heuristic”, “auto_asc_desc”, “peak_heuristic” and “valley_heuristic”. These new heuristic options are devised to produce a remarkable speedup for large size instances, at the expense of not guaranteeing optimal solutions (although the optimal solution is found in the majority of cases).

Let’s start by loading the training data.

[1]:
import pandas as pd
[2]:
df = pd.read_csv("data/kaggle/HomeCreditDefaultRisk/application_train.csv", engine='c')

We choose the same variable to discretize and the binary target.

[3]:
variable = "REGION_POPULATION_RELATIVE"
x = df[variable].values
y = df.TARGET.values
[4]:
from optbinning import OptimalBinning

We use the same options to generate a granular binning, and fit the optimal binning with monotonic_trend="auto".

[5]:
optb = OptimalBinning(name=variable, dtype="numerical", solver="cp",
                      monotonic_trend="auto", max_n_prebins=100,
                      min_prebin_size=0.001, time_limit=200)
[6]:
optb.fit(x, y)
[6]:
OptimalBinning(max_n_prebins=100, min_prebin_size=0.001,
               name='REGION_POPULATION_RELATIVE', time_limit=200)
[7]:
optb.status
[7]:
'OPTIMAL'
[8]:
optb.information(print_level=1)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Name    : REGION_POPULATION_RELATIVE
  Status  : OPTIMAL

  Pre-binning statistics
    Number of pre-bins                    77
    Number of refinements                  0

  Solver statistics
    Type                                  cp
    Number of booleans                  3148
    Number of branches                131879
    Number of conflicts                34480
    Objective value                    37758
    Best objective bound               37758

  Timing
    Total time                        145.75 sec
    Pre-processing                      0.02 sec   (  0.02%)
    Pre-binning                         0.47 sec   (  0.32%)
    Solver                            145.25 sec   ( 99.66%)
      model generation                 25.35 sec   ( 17.45%)
      optimizer                       119.90 sec   ( 82.55%)
    Post-processing                     0.00 sec   (  0.00%)

[9]:
binning_table = optb.binning_table
binning_table.build()
binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

  General metrics

    Gini index               0.08180326
    IV (Jeffrey)             0.03776231
    JS (Jensen-Shannon)      0.00465074
    Hellinger                0.00468508
    Triangular               0.01833822
    KS                       0.06087208
    HHI                      0.23425608
    HHI (normalized)         0.16464300
    Cramer's V               0.05102627
    Quality score            0.06257516

  Monotonic trend                  peak

  Significance tests

    Bin A  Bin B  t-statistic      p-value      P[A > B]     P[B > A]
        0      1     1.445262 2.292897e-01  1.013041e-01 8.986959e-01
        1      2   158.939080 1.929529e-36 1.179082e-218 1.000000e+00
        2      3   131.200666 2.238000e-30  1.000000e+00 1.110223e-16
        3      4     0.878638 3.485750e-01  8.240457e-01 1.759543e-01
        4      5    14.925402 1.118468e-04  9.999989e-01 1.123668e-06
        5      6     3.969732 4.632513e-02  9.768847e-01 2.311530e-02
        6      7    40.662233 1.809509e-10  1.000000e+00 3.330669e-16
        7      8    29.296187 6.211781e-08  1.000000e+00 4.795264e-11
        8      9    10.992632 9.147483e-04  9.997649e-01 2.351244e-04

[10]:
binning_table.plot(metric="event_rate")
../_images/tutorials_tutorial_binary_large_scale_14_0.png

This is a large combinatorial problem, and it took roughly 150 seconds… but we can try the monotonic_trend="auto_heuristic" to accelerate the solution process

[11]:
optb_auto = OptimalBinning(name=variable, dtype="numerical", solver="cp",
                           monotonic_trend="auto_heuristic", max_n_prebins=100,
                           min_prebin_size=0.001, time_limit=200)
[12]:
optb_auto.fit(x, y)
[12]:
OptimalBinning(max_n_prebins=100, min_prebin_size=0.001,
               monotonic_trend='auto_heuristic',
               name='REGION_POPULATION_RELATIVE', time_limit=200)
[13]:
optb_auto.status
[13]:
'OPTIMAL'
[14]:
optb_auto.information(print_level=1)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Name    : REGION_POPULATION_RELATIVE
  Status  : OPTIMAL

  Pre-binning statistics
    Number of pre-bins                    77
    Number of refinements                  0

  Solver statistics
    Type                                  cp
    Number of booleans                   424
    Number of branches                   872
    Number of conflicts                    8
    Objective value                    37758
    Best objective bound               37758

  Timing
    Total time                          9.79 sec
    Pre-processing                      0.00 sec   (  0.04%)
    Pre-binning                         0.46 sec   (  4.69%)
    Solver                              9.32 sec   ( 95.22%)
      model generation                  8.63 sec   ( 92.60%)
      optimizer                         0.69 sec   (  7.40%)
    Post-processing                     0.00 sec   (  0.01%)

[15]:
binning_table = optb_auto.binning_table
binning_table.build()
binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

  General metrics

    Gini index               0.08180326
    IV (Jeffrey)             0.03776231
    JS (Jensen-Shannon)      0.00465074
    Hellinger                0.00468508
    Triangular               0.01833822
    KS                       0.06087208
    HHI                      0.23425608
    HHI (normalized)         0.16464300
    Cramer's V               0.05102627
    Quality score            0.06257516

  Monotonic trend                  peak

  Significance tests

    Bin A  Bin B  t-statistic      p-value      P[A > B]     P[B > A]
        0      1     1.445262 2.292897e-01  1.013041e-01 8.986959e-01
        1      2   158.939080 1.929529e-36 1.179082e-218 1.000000e+00
        2      3   131.200666 2.238000e-30  1.000000e+00 1.110223e-16
        3      4     0.878638 3.485750e-01  8.240457e-01 1.759543e-01
        4      5    14.925402 1.118468e-04  9.999989e-01 1.123668e-06
        5      6     3.969732 4.632513e-02  9.768847e-01 2.311530e-02
        6      7    40.662233 1.809509e-10  1.000000e+00 3.330669e-16
        7      8    29.296187 6.211781e-08  1.000000e+00 4.795264e-11
        8      9    10.992632 9.147483e-04  9.997649e-01 2.351244e-04

[16]:
binning_table.plot(metric="event_rate")
../_images/tutorials_tutorial_binary_large_scale_21_0.png

For this example, we found the optimal solution with an overall 16x speedup, where optimization time is reduced by 99%!!