Tutorial: optimal binning with binary target - LocalSolver

To get us started, let’s load the application_train.csv file from the Kaggle’s competition https://www.kaggle.com/c/home-credit-default-risk/data.

[1]:
import pandas as pd
[2]:
df = pd.read_csv("data/kaggle/HomeCreditDefaultRisk/application_train.csv", engine='c')

We choose a variable to discretize and the binary target.

[3]:
variable = "REGION_POPULATION_RELATIVE"
x = df[variable].values
y = df.TARGET.values

Import and instantiate an OptimalBinning object class. We pass the variable name, its data type, and a solver, in this case, we choose the commercial solver LocalSolver. Note that LocalSolver requires a time limit, which is set to 20 seconds (LocalSolver 10.0). Besides, for this example, we require a more granular binning, therefore we allow a large number of prebins with small size.

To use LocalSolver follow the avaiable instructions: https://www.localsolver.com/docs/last/quickstart/solvingyourfirstmodelinpython.html

[4]:
from optbinning import OptimalBinning
[5]:
optb = OptimalBinning(name=variable, dtype="numerical", solver="ls", max_n_prebins=100,
                      min_prebin_size=0.001, time_limit=20)

We fit the optimal binning object with arrays x and y.

[6]:
optb.fit(x, y)
Push initial solution 100%
Model:  expressions = 79028, decisions = 309, constraints = 6161, objectives = 1
Param:  time limit = 20 sec, no iteration limit

[objective direction ]:     maximize

[  0 sec,       0 itr]:            0
[ optimality gap     ]:      100.00%
[  1 sec,       0 itr]:            0
[  2 sec,    1650 itr]:        34776
[  3 sec,    6276 itr]:        37297
[  4 sec,    6276 itr]:        37297
[  5 sec,    8516 itr]:        37297
[  6 sec,   10771 itr]:        37297
[  7 sec,   12813 itr]:        37297
[  8 sec,   15256 itr]:        37305
[  9 sec,   17634 itr]:        37305
[ 10 sec,   20136 itr]:        37305
[ optimality gap     ]:       82.27%
[ 11 sec,   22303 itr]:        37305
[ 12 sec,   24608 itr]:        37729
[ 13 sec,   26865 itr]:        37758
[ 14 sec,   28828 itr]:        37758
[ 15 sec,   33133 itr]:        37758
[ 16 sec,   33133 itr]:        37758
[ 17 sec,   35139 itr]:        37758
[ 18 sec,   37028 itr]:        37758
[ 19 sec,   40000 itr]:        37758
[ 20 sec,   40000 itr]:        37758
[ optimality gap     ]:       82.06%
[ 20 sec,   40000 itr]:        37758
[ optimality gap     ]:       82.06%

40000 iterations performed in 20 seconds

Feasible solution: 
  obj    =        37758
  gap    =       82.06%
  bounds =       210419
[6]:
OptimalBinning(cat_cutoff=None, class_weight=None, divergence='iv',
               dtype='numerical', gamma=0, max_bin_n_event=None,
               max_bin_n_nonevent=None, max_bin_size=None, max_n_bins=None,
               max_n_prebins=100, max_pvalue=None,
               max_pvalue_policy='consecutive', min_bin_n_event=None,
               min_bin_n_nonevent=None, min_bin_size=None,
               min_event_rate_diff=0, min_n_bins=None, min_prebin_size=0.001,
               mip_solver='bop', monotonic_trend='auto',
               name='REGION_POPULATION_RELATIVE', outlier_detector=None,
               outlier_params=None, prebinning_method='cart', solver='ls',
               special_codes=None, split_digits=None, time_limit=20,
               user_splits=None, user_splits_fixed=None, ...)

You can check if an optimal or feasible solution has been found via the status attribute:

[7]:
optb.status
[7]:
'FEASIBLE'
[8]:
binning_table = optb.binning_table
binning_table.build()
binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

  General metrics

    Gini index               0.08180326
    IV (Jeffrey)             0.03776231
    JS (Jensen-Shannon)      0.00465074
    Hellinger                0.00468508
    Triangular               0.01833822
    KS                       0.06087208
    HHI                      0.23425608
    HHI (normalized)         0.16464300
    Cramer's V               0.05102627
    Quality score            0.06257516

  Monotonic trend                  peak

  Significance tests

    Bin A  Bin B  t-statistic      p-value      P[A > B]     P[B > A]
        0      1     1.445262 2.292897e-01  1.013041e-01 8.986959e-01
        1      2   158.939080 1.929529e-36 1.179082e-218 1.000000e+00
        2      3   131.200666 2.238000e-30  1.000000e+00 1.110223e-16
        3      4     0.878638 3.485750e-01  8.240457e-01 1.759543e-01
        4      5    14.925402 1.118468e-04  9.999989e-01 1.123668e-06
        5      6     3.969732 4.632513e-02  9.768847e-01 2.311530e-02
        6      7    40.662233 1.809509e-10  1.000000e+00 3.330669e-16
        7      8    29.296187 6.211781e-08  1.000000e+00 4.795286e-11
        8      9    10.992632 9.147483e-04  9.997649e-01 2.351244e-04

[9]:
binning_table.plot(metric="event_rate")
../_images/tutorials_tutorial_binary_localsolver_14_0.png
[10]:
optb.information(print_level=1)
optbinning (Version 0.18.0)
Copyright (c) 2019-2023 Guillermo Navas-Palencia, Apache License 2.0

  Name    : REGION_POPULATION_RELATIVE
  Status  : FEASIBLE

  Pre-binning statistics
    Number of pre-bins                    77
    Number of refinements                  0

  Solver statistics
    Type                                  ls
    Number of iterations               40000

  Timing
    Total time                         21.43 sec
    Pre-processing                      0.01 sec   (  0.05%)
    Pre-binning                         0.50 sec   (  2.35%)
    Solver                             20.91 sec   ( 97.59%)
    Post-processing                     0.00 sec   (  0.00%)

Computing the optimal binning starting with a large number of prebins might be challenging in some situations, therefore solvers such as LocalSolver are suitable to find quality feasible solutions in a reasonable amount of time. However, if LocalSolver is not available we can always try solver options “cp” or “mip”.

Constraint programming solver

First, we keep the 5 seconds time limit:

[11]:
optb = OptimalBinning(name=variable, dtype="numerical", solver="cp", max_n_prebins=100,
                      min_prebin_size=0.001, time_limit=5)
[12]:
optb.fit(x, y)
[12]:
OptimalBinning(cat_cutoff=None, class_weight=None, divergence='iv',
               dtype='numerical', gamma=0, max_bin_n_event=None,
               max_bin_n_nonevent=None, max_bin_size=None, max_n_bins=None,
               max_n_prebins=100, max_pvalue=None,
               max_pvalue_policy='consecutive', min_bin_n_event=None,
               min_bin_n_nonevent=None, min_bin_size=None,
               min_event_rate_diff=0, min_n_bins=None, min_prebin_size=0.001,
               mip_solver='bop', monotonic_trend='auto',
               name='REGION_POPULATION_RELATIVE', outlier_detector=None,
               outlier_params=None, prebinning_method='cart', solver='cp',
               special_codes=None, split_digits=None, time_limit=5,
               user_splits=None, user_splits_fixed=None, ...)

The status is “UNKNOWN” therefore nor feasible or optimal solutions was found in 5 seconds.

[13]:
optb.status
[13]:
'UNKNOWN'
[14]:
optb.information(print_level=1)
optbinning (Version 0.18.0)
Copyright (c) 2019-2023 Guillermo Navas-Palencia, Apache License 2.0

  Name    : REGION_POPULATION_RELATIVE
  Status  : UNKNOWN

  Pre-binning statistics
    Number of pre-bins                    77
    Number of refinements                  0

We increase the time limit to 30 seconds.

[15]:
optb = OptimalBinning(name=variable, dtype="numerical", solver="cp", max_n_prebins=100,
                      min_prebin_size=0.001, time_limit=30)
[16]:
optb.fit(x, y)
[16]:
OptimalBinning(cat_cutoff=None, class_weight=None, divergence='iv',
               dtype='numerical', gamma=0, max_bin_n_event=None,
               max_bin_n_nonevent=None, max_bin_size=None, max_n_bins=None,
               max_n_prebins=100, max_pvalue=None,
               max_pvalue_policy='consecutive', min_bin_n_event=None,
               min_bin_n_nonevent=None, min_bin_size=None,
               min_event_rate_diff=0, min_n_bins=None, min_prebin_size=0.001,
               mip_solver='bop', monotonic_trend='auto',
               name='REGION_POPULATION_RELATIVE', outlier_detector=None,
               outlier_params=None, prebinning_method='cart', solver='cp',
               special_codes=None, split_digits=None, time_limit=30,
               user_splits=None, user_splits_fixed=None, ...)

In 30 seconds we found a feasible solution

[17]:
optb.status
[17]:
'FEASIBLE'
[18]:
optb.information(print_level=1)
optbinning (Version 0.18.0)
Copyright (c) 2019-2023 Guillermo Navas-Palencia, Apache License 2.0

  Name    : REGION_POPULATION_RELATIVE
  Status  : FEASIBLE

  Pre-binning statistics
    Number of pre-bins                    77
    Number of refinements                  0

  Solver statistics
    Type                                  cp
    Number of booleans                  3155
    Number of branches                 28213
    Number of conflicts                 6885
    Objective value                    37367
    Best objective bound               74233

  Timing
    Total time                         49.50 sec
    Pre-processing                      0.00 sec   (  0.01%)
    Pre-binning                         0.45 sec   (  0.90%)
    Solver                             49.05 sec   ( 99.09%)
      model generation                 19.04 sec   ( 38.83%)
      optimizer                        30.00 sec   ( 61.17%)
    Post-processing                     0.00 sec   (  0.00%)

[19]:
binning_table = optb.binning_table
binning_table.build()
binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

  General metrics

    Gini index               0.08891470
    IV (Jeffrey)             0.03737164
    JS (Jensen-Shannon)      0.00461702
    Hellinger                0.00464395
    Triangular               0.01825927
    KS                       0.06087208
    HHI                      0.19466583
    HHI (normalized)         0.13271705
    Cramer's V               0.04978528
    Quality score            0.00058822

  Monotonic trend                  peak

  Significance tests

    Bin A  Bin B  t-statistic      p-value     P[A > B]     P[B > A]
        0      1     1.221574 2.690519e-01 1.202075e-01 8.797925e-01
        1      2     3.537222 6.000588e-02 3.120268e-02 9.687973e-01
        2      3     0.078781 7.789566e-01 3.875777e-01 6.124223e-01
        3      4     0.190265 6.626959e-01 3.352569e-01 6.647431e-01
        4      5    26.771144 2.290319e-07 1.971027e-08 1.000000e+00
        5      6     7.184990 7.351597e-03 9.965864e-01 3.413632e-03
        6      7    40.761911 1.719521e-10 1.000000e+00 2.474909e-12
        7      8     0.011697 9.138763e-01 5.399700e-01 4.600300e-01
        8      9    54.211508 1.800294e-13 1.000000e+00 1.110223e-16
        9     10    29.296187 6.211781e-08 1.000000e+00 4.795264e-11
       10     11    10.992632 9.147483e-04 9.997649e-01 2.351244e-04

The current solution is IV = 0.03737164, compared to the LocalSolver solver solution 0.03776231. Let us increase the time limit to 200 seconds.

[20]:
optb = OptimalBinning(name=variable, dtype="numerical", solver="cp", max_n_prebins=100,
                      min_prebin_size=0.001, time_limit=200)
[21]:
optb.fit(x, y)
[21]:
OptimalBinning(cat_cutoff=None, class_weight=None, divergence='iv',
               dtype='numerical', gamma=0, max_bin_n_event=None,
               max_bin_n_nonevent=None, max_bin_size=None, max_n_bins=None,
               max_n_prebins=100, max_pvalue=None,
               max_pvalue_policy='consecutive', min_bin_n_event=None,
               min_bin_n_nonevent=None, min_bin_size=None,
               min_event_rate_diff=0, min_n_bins=None, min_prebin_size=0.001,
               mip_solver='bop', monotonic_trend='auto',
               name='REGION_POPULATION_RELATIVE', outlier_detector=None,
               outlier_params=None, prebinning_method='cart', solver='cp',
               special_codes=None, split_digits=None, time_limit=200,
               user_splits=None, user_splits_fixed=None, ...)

The optimal solution is found within the time limit.

[22]:
optb.status
[22]:
'OPTIMAL'
[23]:
optb.information(print_level=1)
optbinning (Version 0.18.0)
Copyright (c) 2019-2023 Guillermo Navas-Palencia, Apache License 2.0

  Name    : REGION_POPULATION_RELATIVE
  Status  : OPTIMAL

  Pre-binning statistics
    Number of pre-bins                    77
    Number of refinements                  0

  Solver statistics
    Type                                  cp
    Number of booleans                  3155
    Number of branches                165953
    Number of conflicts                75652
    Objective value                    37758
    Best objective bound               37758

  Timing
    Total time                        148.19 sec
    Pre-processing                      0.00 sec   (  0.00%)
    Pre-binning                         0.44 sec   (  0.30%)
    Solver                            147.75 sec   ( 99.70%)
      model generation                 17.90 sec   ( 12.12%)
      optimizer                       129.84 sec   ( 87.88%)
    Post-processing                     0.00 sec   (  0.00%)

[24]:
binning_table = optb.binning_table
binning_table.build()
binning_table.analysis()
---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

  General metrics

    Gini index               0.08180326
    IV (Jeffrey)             0.03776231
    JS (Jensen-Shannon)      0.00465074
    Hellinger                0.00468508
    Triangular               0.01833822
    KS                       0.06087208
    HHI                      0.23425608
    HHI (normalized)         0.16464300
    Cramer's V               0.05102627
    Quality score            0.06257516

  Monotonic trend                  peak

  Significance tests

    Bin A  Bin B  t-statistic      p-value      P[A > B]     P[B > A]
        0      1     1.445262 2.292897e-01  1.013041e-01 8.986959e-01
        1      2   158.939080 1.929529e-36 1.179082e-218 1.000000e+00
        2      3   131.200666 2.238000e-30  1.000000e+00 1.110223e-16
        3      4     0.878638 3.485750e-01  8.240457e-01 1.759543e-01
        4      5    14.925402 1.118468e-04  9.999989e-01 1.123668e-06
        5      6     3.969732 4.632513e-02  9.768847e-01 2.311530e-02
        6      7    40.662233 1.809509e-10  1.000000e+00 3.330669e-16
        7      8    29.296187 6.211781e-08  1.000000e+00 4.795253e-11
        8      9    10.992632 9.147483e-04  9.997649e-01 2.351244e-04

The optimal solution is IV = 0.03776231, matching the LocalSolver solver solution 0.03776231.

[25]:
binning_table.plot(metric="event_rate")
../_images/tutorials_tutorial_binary_localsolver_39_0.png