# Tutorial: optimal binning with binary target - large scale¶

Continuing with the previous tutorial, version 0.4.0 introduced four new monotonic_trend options: “auto_heuristic”, “auto_asc_desc”, “peak_heuristic” and “valley_heuristic”. These new heuristic options are devised to produce a remarkable speedup for large size instances, at the expense of not guaranteeing optimal solutions (although the optimal solution is found in the majority of cases).

[1]:

import pandas as pd

[2]:

df = pd.read_csv("data/kaggle/HomeCreditDefaultRisk/application_train.csv", engine='c')


We choose the same variable to discretize and the binary target.

[3]:

variable = "REGION_POPULATION_RELATIVE"
x = df[variable].values
y = df.TARGET.values

[4]:

from optbinning import OptimalBinning


We use the same options to generate a granular binning, and fit the optimal binning with monotonic_trend="auto".

[5]:

optb = OptimalBinning(name=variable, dtype="numerical", solver="cp",
monotonic_trend="auto", max_n_prebins=100,
min_prebin_size=0.001, time_limit=200)

[6]:

optb.fit(x, y)

[6]:

OptimalBinning(max_n_prebins=100, min_prebin_size=0.001,
name='REGION_POPULATION_RELATIVE', time_limit=200)

[7]:

optb.status

[7]:

'OPTIMAL'

[8]:

optb.information(print_level=1)

optbinning (Version 0.19.0)

Name    : REGION_POPULATION_RELATIVE
Status  : OPTIMAL

Pre-binning statistics
Number of pre-bins                    77
Number of refinements                  0

Solver statistics
Type                                  cp
Number of booleans                  3148
Number of branches                131879
Number of conflicts                34480
Objective value                    37758
Best objective bound               37758

Timing
Total time                        145.75 sec
Pre-processing                      0.02 sec   (  0.02%)
Pre-binning                         0.47 sec   (  0.32%)
Solver                            145.25 sec   ( 99.66%)
model generation                 25.35 sec   ( 17.45%)
optimizer                       119.90 sec   ( 82.55%)
Post-processing                     0.00 sec   (  0.00%)


[9]:

binning_table = optb.binning_table
binning_table.build()
binning_table.analysis()

---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

General metrics

Gini index               0.08180326
IV (Jeffrey)             0.03776231
JS (Jensen-Shannon)      0.00465074
Hellinger                0.00468508
Triangular               0.01833822
KS                       0.06087208
HHI                      0.23425608
HHI (normalized)         0.16464300
Cramer's V               0.05102627
Quality score            0.06257516

Monotonic trend                  peak

Significance tests

Bin A  Bin B  t-statistic      p-value      P[A > B]     P[B > A]
0      1     1.445262 2.292897e-01  1.013041e-01 8.986959e-01
1      2   158.939080 1.929529e-36 1.179082e-218 1.000000e+00
2      3   131.200666 2.238000e-30  1.000000e+00 1.110223e-16
3      4     0.878638 3.485750e-01  8.240457e-01 1.759543e-01
4      5    14.925402 1.118468e-04  9.999989e-01 1.123668e-06
5      6     3.969732 4.632513e-02  9.768847e-01 2.311530e-02
6      7    40.662233 1.809509e-10  1.000000e+00 3.330669e-16
7      8    29.296187 6.211781e-08  1.000000e+00 4.795264e-11
8      9    10.992632 9.147483e-04  9.997649e-01 2.351244e-04


[10]:

binning_table.plot(metric="event_rate")


This is a large combinatorial problem, and it took roughly 150 seconds… but we can try the monotonic_trend="auto_heuristic" to accelerate the solution process

[11]:

optb_auto = OptimalBinning(name=variable, dtype="numerical", solver="cp",
monotonic_trend="auto_heuristic", max_n_prebins=100,
min_prebin_size=0.001, time_limit=200)

[12]:

optb_auto.fit(x, y)

[12]:

OptimalBinning(max_n_prebins=100, min_prebin_size=0.001,
monotonic_trend='auto_heuristic',
name='REGION_POPULATION_RELATIVE', time_limit=200)

[13]:

optb_auto.status

[13]:

'OPTIMAL'

[14]:

optb_auto.information(print_level=1)

optbinning (Version 0.19.0)

Name    : REGION_POPULATION_RELATIVE
Status  : OPTIMAL

Pre-binning statistics
Number of pre-bins                    77
Number of refinements                  0

Solver statistics
Type                                  cp
Number of booleans                   424
Number of branches                   872
Number of conflicts                    8
Objective value                    37758
Best objective bound               37758

Timing
Total time                          9.79 sec
Pre-processing                      0.00 sec   (  0.04%)
Pre-binning                         0.46 sec   (  4.69%)
Solver                              9.32 sec   ( 95.22%)
model generation                  8.63 sec   ( 92.60%)
optimizer                         0.69 sec   (  7.40%)
Post-processing                     0.00 sec   (  0.01%)


[15]:

binning_table = optb_auto.binning_table
binning_table.build()
binning_table.analysis()

---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

General metrics

Gini index               0.08180326
IV (Jeffrey)             0.03776231
JS (Jensen-Shannon)      0.00465074
Hellinger                0.00468508
Triangular               0.01833822
KS                       0.06087208
HHI                      0.23425608
HHI (normalized)         0.16464300
Cramer's V               0.05102627
Quality score            0.06257516

Monotonic trend                  peak

Significance tests

Bin A  Bin B  t-statistic      p-value      P[A > B]     P[B > A]
0      1     1.445262 2.292897e-01  1.013041e-01 8.986959e-01
1      2   158.939080 1.929529e-36 1.179082e-218 1.000000e+00
2      3   131.200666 2.238000e-30  1.000000e+00 1.110223e-16
3      4     0.878638 3.485750e-01  8.240457e-01 1.759543e-01
4      5    14.925402 1.118468e-04  9.999989e-01 1.123668e-06
5      6     3.969732 4.632513e-02  9.768847e-01 2.311530e-02
6      7    40.662233 1.809509e-10  1.000000e+00 3.330669e-16
7      8    29.296187 6.211781e-08  1.000000e+00 4.795264e-11
8      9    10.992632 9.147483e-04  9.997649e-01 2.351244e-04


[16]:

binning_table.plot(metric="event_rate")


For this example, we found the optimal solution with an overall 16x speedup, where optimization time is reduced by 99%!!