Tutorial: optimal binning with binary target - large scale¶
Continuing with the previous tutorial, version 0.4.0 introduced four new monotonic_trend
options: “auto_heuristic”, “auto_asc_desc”, “peak_heuristic” and “valley_heuristic”. These new heuristic options are devised to produce a remarkable speedup for large size instances, at the expense of not guaranteeing optimal solutions (although the optimal solution is found in the majority of cases).
Let’s start by loading the training data.
[1]:
import pandas as pd
[2]:
df = pd.read_csv("data/kaggle/HomeCreditDefaultRisk/application_train.csv", engine='c')
We choose the same variable to discretize and the binary target.
[3]:
variable = "REGION_POPULATION_RELATIVE"
x = df[variable].values
y = df.TARGET.values
[4]:
from optbinning import OptimalBinning
We use the same options to generate a granular binning, and fit the optimal binning with monotonic_trend="auto"
.
[5]:
optb = OptimalBinning(name=variable, dtype="numerical", solver="cp",
monotonic_trend="auto", max_n_prebins=100,
min_prebin_size=0.001, time_limit=200)
[6]:
optb.fit(x, y)
[6]:
OptimalBinning(max_n_prebins=100, min_prebin_size=0.001,
name='REGION_POPULATION_RELATIVE', time_limit=200)
[7]:
optb.status
[7]:
'OPTIMAL'
[8]:
optb.information(print_level=1)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Name : REGION_POPULATION_RELATIVE
Status : OPTIMAL
Pre-binning statistics
Number of pre-bins 77
Number of refinements 0
Solver statistics
Type cp
Number of booleans 3148
Number of branches 131879
Number of conflicts 34480
Objective value 37758
Best objective bound 37758
Timing
Total time 145.75 sec
Pre-processing 0.02 sec ( 0.02%)
Pre-binning 0.47 sec ( 0.32%)
Solver 145.25 sec ( 99.66%)
model generation 25.35 sec ( 17.45%)
optimizer 119.90 sec ( 82.55%)
Post-processing 0.00 sec ( 0.00%)
[9]:
binning_table = optb.binning_table
binning_table.build()
binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------
General metrics
Gini index 0.08180326
IV (Jeffrey) 0.03776231
JS (Jensen-Shannon) 0.00465074
Hellinger 0.00468508
Triangular 0.01833822
KS 0.06087208
HHI 0.23425608
HHI (normalized) 0.16464300
Cramer's V 0.05102627
Quality score 0.06257516
Monotonic trend peak
Significance tests
Bin A Bin B t-statistic p-value P[A > B] P[B > A]
0 1 1.445262 2.292897e-01 1.013041e-01 8.986959e-01
1 2 158.939080 1.929529e-36 1.179082e-218 1.000000e+00
2 3 131.200666 2.238000e-30 1.000000e+00 1.110223e-16
3 4 0.878638 3.485750e-01 8.240457e-01 1.759543e-01
4 5 14.925402 1.118468e-04 9.999989e-01 1.123668e-06
5 6 3.969732 4.632513e-02 9.768847e-01 2.311530e-02
6 7 40.662233 1.809509e-10 1.000000e+00 3.330669e-16
7 8 29.296187 6.211781e-08 1.000000e+00 4.795264e-11
8 9 10.992632 9.147483e-04 9.997649e-01 2.351244e-04
[10]:
binning_table.plot(metric="event_rate")
This is a large combinatorial problem, and it took roughly 150 seconds… but we can try the monotonic_trend="auto_heuristic"
to accelerate the solution process
[11]:
optb_auto = OptimalBinning(name=variable, dtype="numerical", solver="cp",
monotonic_trend="auto_heuristic", max_n_prebins=100,
min_prebin_size=0.001, time_limit=200)
[12]:
optb_auto.fit(x, y)
[12]:
OptimalBinning(max_n_prebins=100, min_prebin_size=0.001,
monotonic_trend='auto_heuristic',
name='REGION_POPULATION_RELATIVE', time_limit=200)
[13]:
optb_auto.status
[13]:
'OPTIMAL'
[14]:
optb_auto.information(print_level=1)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Name : REGION_POPULATION_RELATIVE
Status : OPTIMAL
Pre-binning statistics
Number of pre-bins 77
Number of refinements 0
Solver statistics
Type cp
Number of booleans 424
Number of branches 872
Number of conflicts 8
Objective value 37758
Best objective bound 37758
Timing
Total time 9.79 sec
Pre-processing 0.00 sec ( 0.04%)
Pre-binning 0.46 sec ( 4.69%)
Solver 9.32 sec ( 95.22%)
model generation 8.63 sec ( 92.60%)
optimizer 0.69 sec ( 7.40%)
Post-processing 0.00 sec ( 0.01%)
[15]:
binning_table = optb_auto.binning_table
binning_table.build()
binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------
General metrics
Gini index 0.08180326
IV (Jeffrey) 0.03776231
JS (Jensen-Shannon) 0.00465074
Hellinger 0.00468508
Triangular 0.01833822
KS 0.06087208
HHI 0.23425608
HHI (normalized) 0.16464300
Cramer's V 0.05102627
Quality score 0.06257516
Monotonic trend peak
Significance tests
Bin A Bin B t-statistic p-value P[A > B] P[B > A]
0 1 1.445262 2.292897e-01 1.013041e-01 8.986959e-01
1 2 158.939080 1.929529e-36 1.179082e-218 1.000000e+00
2 3 131.200666 2.238000e-30 1.000000e+00 1.110223e-16
3 4 0.878638 3.485750e-01 8.240457e-01 1.759543e-01
4 5 14.925402 1.118468e-04 9.999989e-01 1.123668e-06
5 6 3.969732 4.632513e-02 9.768847e-01 2.311530e-02
6 7 40.662233 1.809509e-10 1.000000e+00 3.330669e-16
7 8 29.296187 6.211781e-08 1.000000e+00 4.795264e-11
8 9 10.992632 9.147483e-04 9.997649e-01 2.351244e-04
[16]:
binning_table.plot(metric="event_rate")
For this example, we found the optimal solution with an overall 16x speedup, where optimization time is reduced by 99%!!