Tutorial: optimal binning with binary target - LocalSolver¶
To get us started, let’s load the application_train.csv file from the Kaggle’s competition https://www.kaggle.com/c/home-credit-default-risk/data.
[1]:
import pandas as pd
[2]:
df = pd.read_csv("data/kaggle/HomeCreditDefaultRisk/application_train.csv", engine='c')
We choose a variable to discretize and the binary target.
[3]:
variable = "REGION_POPULATION_RELATIVE"
x = df[variable].values
y = df.TARGET.values
Import and instantiate an OptimalBinning
object class. We pass the variable name, its data type, and a solver, in this case, we choose the commercial solver LocalSolver. Note that LocalSolver requires a time limit, which is set to 20 seconds (LocalSolver 10.0). Besides, for this example, we require a more granular binning, therefore we allow a large number of prebins with small size.
To use LocalSolver follow the avaiable instructions: https://www.localsolver.com/docs/last/quickstart/solvingyourfirstmodelinpython.html
[4]:
from optbinning import OptimalBinning
[5]:
optb = OptimalBinning(name=variable, dtype="numerical", solver="ls", max_n_prebins=100,
min_prebin_size=0.001, time_limit=20)
We fit the optimal binning object with arrays x
and y
.
[6]:
optb.fit(x, y)
Push initial solution 100%
Model: expressions = 79028, decisions = 309, constraints = 6161, objectives = 1
Param: time limit = 20 sec, no iteration limit
[objective direction ]: maximize
[ 0 sec, 0 itr]: 0
[ optimality gap ]: 100.00%
[ 1 sec, 0 itr]: 0
[ 2 sec, 1650 itr]: 34776
[ 3 sec, 6276 itr]: 37297
[ 4 sec, 6276 itr]: 37297
[ 5 sec, 8516 itr]: 37297
[ 6 sec, 10771 itr]: 37297
[ 7 sec, 12813 itr]: 37297
[ 8 sec, 15256 itr]: 37305
[ 9 sec, 17634 itr]: 37305
[ 10 sec, 20136 itr]: 37305
[ optimality gap ]: 82.27%
[ 11 sec, 22303 itr]: 37305
[ 12 sec, 24608 itr]: 37729
[ 13 sec, 26865 itr]: 37758
[ 14 sec, 28828 itr]: 37758
[ 15 sec, 33133 itr]: 37758
[ 16 sec, 33133 itr]: 37758
[ 17 sec, 35139 itr]: 37758
[ 18 sec, 37028 itr]: 37758
[ 19 sec, 40000 itr]: 37758
[ 20 sec, 40000 itr]: 37758
[ optimality gap ]: 82.06%
[ 20 sec, 40000 itr]: 37758
[ optimality gap ]: 82.06%
40000 iterations performed in 20 seconds
Feasible solution:
obj = 37758
gap = 82.06%
bounds = 210419
[6]:
OptimalBinning(cat_cutoff=None, class_weight=None, divergence='iv',
dtype='numerical', gamma=0, max_bin_n_event=None,
max_bin_n_nonevent=None, max_bin_size=None, max_n_bins=None,
max_n_prebins=100, max_pvalue=None,
max_pvalue_policy='consecutive', min_bin_n_event=None,
min_bin_n_nonevent=None, min_bin_size=None,
min_event_rate_diff=0, min_n_bins=None, min_prebin_size=0.001,
mip_solver='bop', monotonic_trend='auto',
name='REGION_POPULATION_RELATIVE', outlier_detector=None,
outlier_params=None, prebinning_method='cart', solver='ls',
special_codes=None, split_digits=None, time_limit=20,
user_splits=None, user_splits_fixed=None, ...)
You can check if an optimal or feasible solution has been found via the status
attribute:
[7]:
optb.status
[7]:
'FEASIBLE'
[8]:
binning_table = optb.binning_table
binning_table.build()
binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------
General metrics
Gini index 0.08180326
IV (Jeffrey) 0.03776231
JS (Jensen-Shannon) 0.00465074
Hellinger 0.00468508
Triangular 0.01833822
KS 0.06087208
HHI 0.23425608
HHI (normalized) 0.16464300
Cramer's V 0.05102627
Quality score 0.06257516
Monotonic trend peak
Significance tests
Bin A Bin B t-statistic p-value P[A > B] P[B > A]
0 1 1.445262 2.292897e-01 1.013041e-01 8.986959e-01
1 2 158.939080 1.929529e-36 1.179082e-218 1.000000e+00
2 3 131.200666 2.238000e-30 1.000000e+00 1.110223e-16
3 4 0.878638 3.485750e-01 8.240457e-01 1.759543e-01
4 5 14.925402 1.118468e-04 9.999989e-01 1.123668e-06
5 6 3.969732 4.632513e-02 9.768847e-01 2.311530e-02
6 7 40.662233 1.809509e-10 1.000000e+00 3.330669e-16
7 8 29.296187 6.211781e-08 1.000000e+00 4.795286e-11
8 9 10.992632 9.147483e-04 9.997649e-01 2.351244e-04
[9]:
binning_table.plot(metric="event_rate")
[10]:
optb.information(print_level=1)
optbinning (Version 0.18.0)
Copyright (c) 2019-2023 Guillermo Navas-Palencia, Apache License 2.0
Name : REGION_POPULATION_RELATIVE
Status : FEASIBLE
Pre-binning statistics
Number of pre-bins 77
Number of refinements 0
Solver statistics
Type ls
Number of iterations 40000
Timing
Total time 21.43 sec
Pre-processing 0.01 sec ( 0.05%)
Pre-binning 0.50 sec ( 2.35%)
Solver 20.91 sec ( 97.59%)
Post-processing 0.00 sec ( 0.00%)
Computing the optimal binning starting with a large number of prebins might be challenging in some situations, therefore solvers such as LocalSolver are suitable to find quality feasible solutions in a reasonable amount of time. However, if LocalSolver is not available we can always try solver options “cp” or “mip”.
Constraint programming solver¶
First, we keep the 5 seconds time limit:
[11]:
optb = OptimalBinning(name=variable, dtype="numerical", solver="cp", max_n_prebins=100,
min_prebin_size=0.001, time_limit=5)
[12]:
optb.fit(x, y)
[12]:
OptimalBinning(cat_cutoff=None, class_weight=None, divergence='iv',
dtype='numerical', gamma=0, max_bin_n_event=None,
max_bin_n_nonevent=None, max_bin_size=None, max_n_bins=None,
max_n_prebins=100, max_pvalue=None,
max_pvalue_policy='consecutive', min_bin_n_event=None,
min_bin_n_nonevent=None, min_bin_size=None,
min_event_rate_diff=0, min_n_bins=None, min_prebin_size=0.001,
mip_solver='bop', monotonic_trend='auto',
name='REGION_POPULATION_RELATIVE', outlier_detector=None,
outlier_params=None, prebinning_method='cart', solver='cp',
special_codes=None, split_digits=None, time_limit=5,
user_splits=None, user_splits_fixed=None, ...)
The status is “UNKNOWN” therefore nor feasible or optimal solutions was found in 5 seconds.
[13]:
optb.status
[13]:
'UNKNOWN'
[14]:
optb.information(print_level=1)
optbinning (Version 0.18.0)
Copyright (c) 2019-2023 Guillermo Navas-Palencia, Apache License 2.0
Name : REGION_POPULATION_RELATIVE
Status : UNKNOWN
Pre-binning statistics
Number of pre-bins 77
Number of refinements 0
We increase the time limit to 30 seconds.
[15]:
optb = OptimalBinning(name=variable, dtype="numerical", solver="cp", max_n_prebins=100,
min_prebin_size=0.001, time_limit=30)
[16]:
optb.fit(x, y)
[16]:
OptimalBinning(cat_cutoff=None, class_weight=None, divergence='iv',
dtype='numerical', gamma=0, max_bin_n_event=None,
max_bin_n_nonevent=None, max_bin_size=None, max_n_bins=None,
max_n_prebins=100, max_pvalue=None,
max_pvalue_policy='consecutive', min_bin_n_event=None,
min_bin_n_nonevent=None, min_bin_size=None,
min_event_rate_diff=0, min_n_bins=None, min_prebin_size=0.001,
mip_solver='bop', monotonic_trend='auto',
name='REGION_POPULATION_RELATIVE', outlier_detector=None,
outlier_params=None, prebinning_method='cart', solver='cp',
special_codes=None, split_digits=None, time_limit=30,
user_splits=None, user_splits_fixed=None, ...)
In 30 seconds we found a feasible solution
[17]:
optb.status
[17]:
'FEASIBLE'
[18]:
optb.information(print_level=1)
optbinning (Version 0.18.0)
Copyright (c) 2019-2023 Guillermo Navas-Palencia, Apache License 2.0
Name : REGION_POPULATION_RELATIVE
Status : FEASIBLE
Pre-binning statistics
Number of pre-bins 77
Number of refinements 0
Solver statistics
Type cp
Number of booleans 3155
Number of branches 28213
Number of conflicts 6885
Objective value 37367
Best objective bound 74233
Timing
Total time 49.50 sec
Pre-processing 0.00 sec ( 0.01%)
Pre-binning 0.45 sec ( 0.90%)
Solver 49.05 sec ( 99.09%)
model generation 19.04 sec ( 38.83%)
optimizer 30.00 sec ( 61.17%)
Post-processing 0.00 sec ( 0.00%)
[19]:
binning_table = optb.binning_table
binning_table.build()
binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------
General metrics
Gini index 0.08891470
IV (Jeffrey) 0.03737164
JS (Jensen-Shannon) 0.00461702
Hellinger 0.00464395
Triangular 0.01825927
KS 0.06087208
HHI 0.19466583
HHI (normalized) 0.13271705
Cramer's V 0.04978528
Quality score 0.00058822
Monotonic trend peak
Significance tests
Bin A Bin B t-statistic p-value P[A > B] P[B > A]
0 1 1.221574 2.690519e-01 1.202075e-01 8.797925e-01
1 2 3.537222 6.000588e-02 3.120268e-02 9.687973e-01
2 3 0.078781 7.789566e-01 3.875777e-01 6.124223e-01
3 4 0.190265 6.626959e-01 3.352569e-01 6.647431e-01
4 5 26.771144 2.290319e-07 1.971027e-08 1.000000e+00
5 6 7.184990 7.351597e-03 9.965864e-01 3.413632e-03
6 7 40.761911 1.719521e-10 1.000000e+00 2.474909e-12
7 8 0.011697 9.138763e-01 5.399700e-01 4.600300e-01
8 9 54.211508 1.800294e-13 1.000000e+00 1.110223e-16
9 10 29.296187 6.211781e-08 1.000000e+00 4.795264e-11
10 11 10.992632 9.147483e-04 9.997649e-01 2.351244e-04
The current solution is IV = 0.03737164, compared to the LocalSolver solver solution 0.03776231. Let us increase the time limit to 200 seconds.
[20]:
optb = OptimalBinning(name=variable, dtype="numerical", solver="cp", max_n_prebins=100,
min_prebin_size=0.001, time_limit=200)
[21]:
optb.fit(x, y)
[21]:
OptimalBinning(cat_cutoff=None, class_weight=None, divergence='iv',
dtype='numerical', gamma=0, max_bin_n_event=None,
max_bin_n_nonevent=None, max_bin_size=None, max_n_bins=None,
max_n_prebins=100, max_pvalue=None,
max_pvalue_policy='consecutive', min_bin_n_event=None,
min_bin_n_nonevent=None, min_bin_size=None,
min_event_rate_diff=0, min_n_bins=None, min_prebin_size=0.001,
mip_solver='bop', monotonic_trend='auto',
name='REGION_POPULATION_RELATIVE', outlier_detector=None,
outlier_params=None, prebinning_method='cart', solver='cp',
special_codes=None, split_digits=None, time_limit=200,
user_splits=None, user_splits_fixed=None, ...)
The optimal solution is found within the time limit.
[22]:
optb.status
[22]:
'OPTIMAL'
[23]:
optb.information(print_level=1)
optbinning (Version 0.18.0)
Copyright (c) 2019-2023 Guillermo Navas-Palencia, Apache License 2.0
Name : REGION_POPULATION_RELATIVE
Status : OPTIMAL
Pre-binning statistics
Number of pre-bins 77
Number of refinements 0
Solver statistics
Type cp
Number of booleans 3155
Number of branches 165953
Number of conflicts 75652
Objective value 37758
Best objective bound 37758
Timing
Total time 148.19 sec
Pre-processing 0.00 sec ( 0.00%)
Pre-binning 0.44 sec ( 0.30%)
Solver 147.75 sec ( 99.70%)
model generation 17.90 sec ( 12.12%)
optimizer 129.84 sec ( 87.88%)
Post-processing 0.00 sec ( 0.00%)
[24]:
binning_table = optb.binning_table
binning_table.build()
binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------
General metrics
Gini index 0.08180326
IV (Jeffrey) 0.03776231
JS (Jensen-Shannon) 0.00465074
Hellinger 0.00468508
Triangular 0.01833822
KS 0.06087208
HHI 0.23425608
HHI (normalized) 0.16464300
Cramer's V 0.05102627
Quality score 0.06257516
Monotonic trend peak
Significance tests
Bin A Bin B t-statistic p-value P[A > B] P[B > A]
0 1 1.445262 2.292897e-01 1.013041e-01 8.986959e-01
1 2 158.939080 1.929529e-36 1.179082e-218 1.000000e+00
2 3 131.200666 2.238000e-30 1.000000e+00 1.110223e-16
3 4 0.878638 3.485750e-01 8.240457e-01 1.759543e-01
4 5 14.925402 1.118468e-04 9.999989e-01 1.123668e-06
5 6 3.969732 4.632513e-02 9.768847e-01 2.311530e-02
6 7 40.662233 1.809509e-10 1.000000e+00 3.330669e-16
7 8 29.296187 6.211781e-08 1.000000e+00 4.795253e-11
8 9 10.992632 9.147483e-04 9.997649e-01 2.351244e-04
The optimal solution is IV = 0.03776231, matching the LocalSolver solver solution 0.03776231.
[25]:
binning_table.plot(metric="event_rate")