Tutorial: optimal binning with binary target¶

Basic¶

To get us started, let’s load a well-known dataset from the UCI repository and transform the data into a pandas.DataFrame.

[1]:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

[2]:

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)

We choose a variable to discretize and the binary target.

[3]:

variable = "mean radius"
x = df[variable].values
y = data.target

Import and instantiate an OptimalBinning object class. We pass the variable name, its data type, and a solver, in this case, we choose the constraint programming solver.

[4]:

from optbinning import OptimalBinning

[5]:

optb = OptimalBinning(name=variable, dtype="numerical", solver="cp")

We fit the optimal binning object with arrays x and y.

[6]:

optb.fit(x, y)

[6]:

OptimalBinning(name='mean radius')

You can check if an optimal solution has been found via the status attribute:

[7]:

optb.status

[7]:

'OPTIMAL'

You can also retrieve the optimal split points via the splits attribute:

[8]:

optb.splits

[8]:

array([11.42500019, 12.32999992, 13.09499979, 13.70499992, 15.04500008,
       16.92500019])

The binning table¶

The optimal binning algorithms return a binning table; a binning table displays the binned data and several metrics for each bin. Class OptimalBinning returns an object BinningTable via the binning_table attribute.

[9]:

binning_table = optb.binning_table

[10]:

type(binning_table)

[10]:

optbinning.binning.binning_statistics.BinningTable

The binning_table is instantiated, but not built. Therefore, the first step is to call the method build, which returns a pandas.DataFrame.

[11]:

binning_table.build()

[11]:

	Bin	Count	Count (%)	Non-event	Event	Event rate	WoE	IV	JS
0	(-inf, 11.43)	118	0.207381	3	115	0.974576	-3.125170	0.962483	0.087205
1	[11.43, 12.33)	79	0.138840	3	76	0.962025	-2.710972	0.538763	0.052198
2	[12.33, 13.09)	68	0.119508	7	61	0.897059	-1.643814	0.226599	0.025513
3	[13.09, 13.70)	49	0.086116	10	39	0.795918	-0.839827	0.052131	0.006331
4	[13.70, 15.05)	83	0.145870	28	55	0.662651	-0.153979	0.003385	0.000423
5	[15.05, 16.93)	54	0.094903	44	10	0.185185	2.002754	0.359566	0.038678
6	[16.93, inf)	118	0.207381	117	1	0.008475	5.283323	2.900997	0.183436
7	Special	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000
8	Missing	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000
Totals		569	1.000000	212	357	0.627417		5.043925	0.393784

Let’s describe the columns of this binning table:

Bin: the intervals delimited by the optimal split points.
Count: the number of records for each bin.
Count (%): the percentage of records for each bin.
Non-event: the number of non-event records \((y = 0)\) for each bin.
Event: the number of event records \((y = 1)\) for each bin.
Event rate: the percentage of event records for each bin.
WoE: the Weight-of-Evidence for each bin.
IV: the Information Value (also known as Jeffrey’s divergence) for each bin.
JS: the Jensen-Shannon divergence for each bin.

The last row shows the total number of records, non-event records, event records, and IV and JS.

You can use the method plot to visualize the histogram and WoE or event rate curve. Note that the Bin ID corresponds to the binning table index.

[12]:

binning_table.plot(metric="woe")

../_images/tutorials_tutorial_binary_24_0.png

[13]:

binning_table.plot(metric="event_rate")

../_images/tutorials_tutorial_binary_25_0.png

Note that WoE is inversely related to the event rate, i.e., a monotonically ascending event rate ensures a monotonically descending WoE and vice-versa. We will see more monotonic trend options in the advanced tutorial.

Event rate / WoE transformation¶

Now that we have checked the binned data, we can transform our original data into WoE or event rate values. You can check the correctness of the transformation using pandas value_counts method, for instance.

[14]:

x_transform_woe = optb.transform(x, metric="woe")

[15]:

pd.Series(x_transform_woe).value_counts()

[15]:

-3.125170    118
 5.283323    118
-0.153979     83
-2.710972     79
-1.643814     68
 2.002754     54
-0.839827     49
dtype: int64

[16]:

x_transform_event_rate = optb.transform(x, metric="event_rate")

[17]:

pd.Series(x_transform_event_rate).value_counts()

[17]:

0.974576    118
0.008475    118
0.662651     83
0.962025     79
0.897059     68
0.185185     54
0.795918     49
dtype: int64

[18]:

x_transform_indices = optb.transform(x, metric="indices")

[19]:

pd.Series(x_transform_indices).value_counts()

[19]:

0    118
6    118
4     83
1     79
2     68
5     54
3     49
dtype: int64

[20]:

x_transform_bins = optb.transform(x, metric="bins")

[21]:

pd.Series(x_transform_bins).value_counts()

[21]:

(-inf, 11.43)     118
[16.93, inf)      118
[13.70, 15.05)     83
[11.43, 12.33)     79
[12.33, 13.09)     68
[15.05, 16.93)     54
[13.09, 13.70)     49
dtype: int64

Categorical variable¶

Let’s load the application_train.csv file from the Kaggle’s competition https://www.kaggle.com/c/home-credit-default-risk/data.

[22]:

df_cat = pd.read_csv("data/kaggle/HomeCreditDefaultRisk/application_train.csv",
                     engine='c')

[23]:

variable_cat = "NAME_INCOME_TYPE"
x_cat = df_cat[variable_cat].values
y_cat = df_cat.TARGET.values

[24]:

df_cat[variable_cat].value_counts()

[24]:

Working                 158774
Commercial associate     71617
Pensioner                55362
State servant            21703
Unemployed                  22
Student                     18
Businessman                 10
Maternity leave              5
Name: NAME_INCOME_TYPE, dtype: int64

We instantiate an OptimalBinning object class with the variable name, its data type (categorical) and a solver, in this case, we choose the mixed-integer programming solver. Also, for this particular example, we set a cat_cutoff=0.1 to create bin others with categories in which the percentage of occurrences is below 10%. This will merge categories State servant, Unemployed, Student, Businessman and Maternity leave.

[25]:

optb = OptimalBinning(name=variable_cat, dtype="categorical", solver="mip",
                      cat_cutoff=0.1)

[26]:

optb.fit(x_cat, y_cat)

[26]:

OptimalBinning(cat_cutoff=0.1, dtype='categorical', name='NAME_INCOME_TYPE',
               solver='mip')

[27]:

optb.status

[27]:

'OPTIMAL'

The optimal split points are the list of classes belonging to each bin.

[28]:

optb.splits

[28]:

[array(['Pensioner'], dtype=object),
 array(['Commercial associate'], dtype=object),
 array(['Working'], dtype=object),
 array(['State servant', 'Unemployed', 'Student', 'Businessman',
        'Maternity leave'], dtype=object)]

[29]:

binning_table = optb.binning_table
binning_table.build()

[29]:

	Bin	Count	Count (%)	Non-event	Event	Event rate	WoE	IV	JS
0	[Pensioner]	55362	0.180033	52380	2982	0.053864	0.433445	0.028249	0.003504
1	[Commercial associate]	71617	0.232892	66257	5360	0.074843	0.082092	0.001516	0.000189
2	[Working]	158774	0.516320	143550	15224	0.095885	-0.188675	0.019895	0.002483
3	[State servant, Unemployed, Student, Businessm...	21758	0.070755	20499	1259	0.057864	0.357573	0.007795	0.000969
4	Special	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000
5	Missing	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000
Totals		307511	1.000000	282686	24825	0.080729		0.057455	0.007146

You can use the method plot to visualize the histogram and WoE or event rate curve. Note that for categorical variables the optimal bins are always monotonically ascending with respect to the event rate. Finally, note that bin 3 corresponds to bin others and is represented by using a lighter color.

[30]:

binning_table.plot(metric="event_rate")

../_images/tutorials_tutorial_binary_50_0.png

Same as for the numerical dtype, we can transform our original data into WoE or event rate values. Since version 0.17.1, if cat_unknown is None (default), transformation of unobserved categories during training follows this rule:

if transform metric == 'woe' then woe(mean event rate) = 0
if transform metric == 'event_rate' then mean event rate
if transform metric == 'indices' then -1
if transform metric == 'bins' then ‘unknown’

[31]:

x_new = ["Businessman", "Working", "New category"]

[32]:

x_transform_woe = optb.transform(x_new, metric="woe")
pd.DataFrame({variable_cat: x_new, "WoE": x_transform_woe})

[32]:

	NAME_INCOME_TYPE	WoE
0	Businessman	0.357573
1	Working	-0.188675
2	New category	0.000000

[33]:

x_transform_event_rate = optb.transform(x_new, metric="event_rate")
pd.DataFrame({variable_cat: x_new, "Event rate": x_transform_event_rate})

[33]:

	NAME_INCOME_TYPE	Event rate
0	Businessman	0.057864
1	Working	0.095885
2	New category	0.080729

[34]:

x_transform_bins = optb.transform(x_new, metric="bins")
pd.DataFrame({variable_cat: x_new, "Bin": x_transform_bins})

[34]:

	NAME_INCOME_TYPE	Bin
0	Businessman	['State servant' 'Unemployed' 'Student' 'Busin...
1	Working	['Working']
2	New category	unknown

[35]:

x_transform_indices = optb.transform(x_new, metric="indices")
pd.DataFrame({variable_cat: x_new, "Index": x_transform_indices})

[35]:

	NAME_INCOME_TYPE	Index
0	Businessman	3
1	Working	2
2	New category	-1

Advanced¶

Optimal binning Information¶

The OptimalBinning can print overview information about the options settings, problem statistics, and the solution of the computation. By default, print_level=1.

[36]:

optb = OptimalBinning(name=variable, dtype="numerical", solver="mip")
optb.fit(x, y)

[36]:

OptimalBinning(name='mean radius', solver='mip')

If print_level=0, a minimal output including the header, variable name, status, and total time are printed.

[37]:

optb.information(print_level=0)

optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Name    : mean radius
  Status  : OPTIMAL

  Time    : 0.0316  sec

If print_level>=1, statistics on the pre-binning phase and the solver are printed. More detailed timing statistics are also included.

[38]:

optb.information(print_level=1)

optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Name    : mean radius
  Status  : OPTIMAL

  Pre-binning statistics
    Number of pre-bins                     9
    Number of refinements                  1

  Solver statistics
    Type                                 mip
    Number of variables                   85
    Number of constraints                 45
    Objective value                   5.0439
    Best objective bound              5.0439

  Timing
    Total time                          0.03 sec
    Pre-processing                      0.00 sec   (  0.73%)
    Pre-binning                         0.00 sec   ( 10.39%)
    Solver                              0.03 sec   ( 87.59%)
    Post-processing                     0.00 sec   (  0.31%)

If print_level=2, the list of all options of the OptimalBinning are displayed. The output contains the option name, its current value and an indicator for how it was set. The unchanged options from the default settings are noted by “d”, and the options set by the user changed from the default settings are noted by “U”. This is inspired by the NAG solver e04mtc printed output, see https://www.nag.co.uk/numeric/cl/nagdoc_cl26/html/e04/e04mtc.html#fcomments.

[39]:

optb.information(print_level=2)

optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Begin options
    name                         mean radius   * U
    dtype                          numerical   * d
    prebinning_method                   cart   * d
    solver                               mip   * U
    divergence                            iv   * d
    max_n_prebins                         20   * d
    min_prebin_size                     0.05   * d
    min_n_bins                            no   * d
    max_n_bins                            no   * d
    min_bin_size                          no   * d
    max_bin_size                          no   * d
    min_bin_n_nonevent                    no   * d
    max_bin_n_nonevent                    no   * d
    min_bin_n_event                       no   * d
    max_bin_n_event                       no   * d
    monotonic_trend                     auto   * d
    min_event_rate_diff                    0   * d
    max_pvalue                            no   * d
    max_pvalue_policy            consecutive   * d
    gamma                                  0   * d
    class_weight                          no   * d
    cat_cutoff                            no   * d
    cat_unknown                           no   * d
    user_splits                           no   * d
    user_splits_fixed                     no   * d
    special_codes                         no   * d
    split_digits                          no   * d
    mip_solver                           bop   * d
    time_limit                           100   * d
    verbose                            False   * d
  End options

  Name    : mean radius
  Status  : OPTIMAL

  Pre-binning statistics
    Number of pre-bins                     9
    Number of refinements                  1

  Solver statistics
    Type                                 mip
    Number of variables                   85
    Number of constraints                 45
    Objective value                   5.0439
    Best objective bound              5.0439

  Timing
    Total time                          0.03 sec
    Pre-processing                      0.00 sec   (  0.73%)
    Pre-binning                         0.00 sec   ( 10.39%)
    Solver                              0.03 sec   ( 87.59%)
    Post-processing                     0.00 sec   (  0.31%)

Binning table statistical analysis¶

The analysis method performs a statistical analysis of the binning table, computing the statistics Gini index, Information Value (IV), Jensen-Shannon divergence, and the quality score. Additionally, several statistical significance tests between consecutive bins of the contingency table are performed: a frequentist test using the Chi-square test or the Fisher’s exact test, and a Bayesian A/B test using the beta distribution as a conjugate prior of the Bernoulli distribution.

[40]:

binning_table.analysis(pvalue_test="chi2")

---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

  General metrics

    Gini index               0.12175489
    IV (Jeffrey)             0.05745546
    JS (Jensen-Shannon)      0.00714565
    Hellinger                0.00716372
    Triangular               0.02843984
    KS                       0.08364544
    HHI                      0.35824301
    HHI (normalized)         0.22989161
    Cramer's V               0.06007763
    Quality score            0.18240827

  Monotonic trend                  peak

  Significance tests

    Bin A  Bin B  t-statistic      p-value     P[A > B]  P[B > A]
        0      1   223.890188 1.281939e-50 4.799305e-71       1.0
        1      2   268.591060 2.301360e-60 6.289833e-76       1.0

[41]:

binning_table.analysis(pvalue_test="fisher")

---------------------------------------------
OptimalBinning: Binary Binning Table Analysis
---------------------------------------------

  General metrics

    Gini index               0.12175489
    IV (Jeffrey)             0.05745546
    JS (Jensen-Shannon)      0.00714565
    Hellinger                0.00716372
    Triangular               0.02843984
    KS                       0.08364544
    HHI                      0.35824301
    HHI (normalized)         0.22989161
    Cramer's V               0.06007763
    Quality score            0.18240827

  Monotonic trend                  peak

  Significance tests

    Bin A  Bin B  odd ratio      p-value     P[A > B]  P[B > A]
        0      1   1.420990 2.091361e-51 4.799305e-71       1.0
        1      2   1.310969 4.434577e-62 6.289833e-76       1.0

Event rate / WoE monotonicity¶

The monotonic_trend option permits forcing a monotonic trend to the event rate curve. The default setting “auto” should be the preferred option, however, some business constraints might require to impose different trends. The default setting “auto” chooses the monotonic trend most likely to maximize the information value from the options “ascending”, “descending”, “peak” and “valley” using a machine-learning-based classifier.

[42]:

variable = "mean texture"
x = df[variable].values
y = data.target

[43]:

optb = OptimalBinning(name=variable, dtype="numerical", solver="cp")
optb.fit(x, y)

[43]:

OptimalBinning(name='mean texture')

[44]:

binning_table = optb.binning_table
binning_table.build()

[44]:

	Bin	Count	Count (%)	Non-event	Event	Event rate	WoE	IV	JS
0	(-inf, 15.05)	92	0.161687	4	88	0.956522	-2.569893	0.584986	0.057939
1	[15.05, 16.39)	61	0.107206	8	53	0.868852	-1.369701	0.151658	0.017602
2	[16.39, 17.03)	29	0.050967	6	23	0.793103	-0.822585	0.029715	0.003613
3	[17.03, 18.46)	79	0.138840	17	62	0.784810	-0.772772	0.072239	0.008812
4	[18.46, 19.47)	55	0.096661	20	35	0.636364	-0.038466	0.000142	0.000018
5	[19.47, 20.20)	36	0.063269	18	18	0.500000	0.521150	0.017972	0.002221
6	[20.20, 21.71)	72	0.126538	43	29	0.402778	0.915054	0.111268	0.013443
7	[21.71, 22.74)	40	0.070299	27	13	0.325000	1.252037	0.113865	0.013371
8	[22.74, 24.00)	29	0.050967	24	5	0.172414	2.089765	0.207309	0.022035
9	[24.00, 26.98)	43	0.075571	30	13	0.302326	1.357398	0.142656	0.016578
10	[26.98, inf)	33	0.057996	15	18	0.545455	0.338828	0.006890	0.000857
11	Special	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000
12	Missing	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000
Totals		569	1.000000	212	357	0.627417		1.438701	0.156488

[45]:

binning_table.plot(metric="event_rate")

../_images/tutorials_tutorial_binary_76_0.png

For example, we can force the variable mean texture to be monotonically descending with respect to the probability of having breast cancer.

[46]:

optb = OptimalBinning(name=variable, dtype="numerical", solver="cp",
                      monotonic_trend="descending")
optb.fit(x, y)

[46]:

OptimalBinning(monotonic_trend='descending', name='mean texture')

[47]:

binning_table = optb.binning_table
binning_table.build()
binning_table.plot(metric="event_rate")

../_images/tutorials_tutorial_binary_79_0.png

Reduction of dominating bins¶

Version 0.3.0 introduced a new constraint to produce more homogeneous solutions by reducing a concentration metric such as the difference between the largest and smallest bin. The added regularization parameter gamma controls the importance of the reduction term. Larger values specify stronger regularization. Continuing with the previous example

[48]:

optb = OptimalBinning(name=variable, dtype="numerical", solver="cp",
                      monotonic_trend="descending", gamma=0.5)
optb.fit(x, y)

[48]:

OptimalBinning(gamma=0.5, monotonic_trend='descending', name='mean texture')

[49]:

binning_table = optb.binning_table
binning_table.build()
binning_table.plot(metric="event_rate")

../_images/tutorials_tutorial_binary_83_0.png

Note that the new solution produces more homogeneous bins, removing the dominance of bin 7 previously observed.

User-defined split points¶

In some situations, we have defined split points or bins required to satisfy a priori belief, knowledge or business constraint. The OptimalBinning permits to pass user-defined split points for numerical variables and user-defined bins for categorical variables. The supplied information is used as a pre-binning, disallowing any pre-binning method set by the user. Furthermore, version 0.5.0 introduces user_splits_fixed parameter, to allow the user to fix some user-defined splits, so these must appear in the solution.

Example numerical variable:

[50]:

user_splits =       [  14,    15,    16,    17,    20,    21,    22,    27]
user_splits_fixed = [False, True,  True, False, False, False, False, False]

[51]:

optb = OptimalBinning(name=variable, dtype="numerical", solver="mip",
                      user_splits=user_splits, user_splits_fixed=user_splits_fixed)

optb.fit(x, y)

[51]:

OptimalBinning(name='mean texture', solver='mip',
               user_splits=[14, 15, 16, 17, 20, 21, 22, 27],
               user_splits_fixed=array([False,  True,  True, False, False, False, False, False]))

[52]:

binning_table = optb.binning_table
binning_table.build()

[52]:

	Bin	Count	Count (%)	Non-event	Event	Event rate	WoE	IV	JS
0	(-inf, 14.00)	54	0.094903	2	52	0.962963	-2.736947	0.372839	0.035974
1	[14.00, 15.00)	37	0.065026	2	35	0.945946	-2.341051	0.207429	0.021268
2	[15.00, 16.00)	43	0.075571	7	36	0.837209	-1.116459	0.075720	0.009002
3	[16.00, 20.00)	210	0.369069	59	151	0.719048	-0.418593	0.060557	0.007515
4	[20.00, 21.00)	45	0.079086	26	19	0.422222	0.834807	0.057952	0.007041
5	[21.00, 22.00)	49	0.086116	30	19	0.387755	0.977908	0.086338	0.010382
6	[22.00, 27.00)	99	0.173989	71	28	0.282828	1.451625	0.372304	0.042839
7	[27.00, inf)	32	0.056239	15	17	0.531250	0.395986	0.009161	0.001138
8	Special	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000
9	Missing	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000
Totals		569	1.000000	212	357	0.627417		1.242301	0.135158

[53]:

optb.information()

optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Name    : mean texture
  Status  : OPTIMAL

  Pre-binning statistics
    Number of pre-bins                     9
    Number of refinements                  0

  Solver statistics
    Type                                 mip
    Number of variables                  137
    Number of constraints                 55
    Objective value                   1.2423
    Best objective bound              1.2423

  Timing
    Total time                          0.07 sec
    Pre-processing                      0.00 sec   (  0.26%)
    Pre-binning                         0.00 sec   (  1.50%)
    Solver                              0.07 sec   ( 97.15%)
    Post-processing                     0.00 sec   (  0.18%)

Example categorical variable:

[54]:

user_splits = np.array([
               ['Businessman'],
               ['Working'],
               ['Commercial associate'],
               ['Pensioner', 'Maternity leave'],
               ['State servant'],
               ['Unemployed', 'Student']], dtype=object)

[55]:

optb = OptimalBinning(name=variable_cat, dtype="categorical", solver="cp",
                      user_splits=user_splits,
                      user_splits_fixed=[False, True, True, True, True, True])

optb.fit(x_cat, y_cat)

[55]:

OptimalBinning(dtype='categorical', name='NAME_INCOME_TYPE',
               user_splits=array([list(['Working']), list(['Commercial associate']),
       list(['Pensioner', 'Maternity leave']), list(['State servant']),
       list(['Unemployed', 'Student'])], dtype=object),
               user_splits_fixed=array([ True,  True,  True,  True,  True]))

[56]:

binning_table = optb.binning_table
binning_table.build()

[56]:

	Bin	Count	Count (%)	Non-event	Event	Event rate	WoE	IV	JS
0	[Businessman, Pensioner, Maternity leave]	55377	0.180081	52393	2984	0.053885	0.433023	0.028206	0.003499
1	[State servant]	21703	0.070576	20454	1249	0.057550	0.363350	0.008010	0.000996
2	[Commercial associate]	71617	0.232892	66257	5360	0.074843	0.082092	0.001516	0.000189
3	[Working]	158774	0.516320	143550	15224	0.095885	-0.188675	0.019895	0.002483
4	[Unemployed, Student]	40	0.000130	32	8	0.200000	-1.046191	0.000219	0.000026
5	Special	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000
6	Missing	0	0.000000	0	0	0.000000	0.000000	0.000000	0.000000
Totals		307511	1.000000	282686	24825	0.080729		0.057846	0.007193

[57]:

optb.binning_table.plot(metric="event_rate")

../_images/tutorials_tutorial_binary_96_0.png

[58]:

optb.information()

optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Name    : NAME_INCOME_TYPE
  Status  : OPTIMAL

  Pre-binning statistics
    Number of pre-bins                     5
    Number of refinements                  1

  Solver statistics
    Type                                  cp
    Number of booleans                     0
    Number of branches                     0
    Number of conflicts                    0
    Objective value                    57843
    Best objective bound               57843

  Timing
    Total time                          0.28 sec
    Pre-processing                      0.04 sec   ( 16.31%)
    Pre-binning                         0.22 sec   ( 79.89%)
    Solver                              0.01 sec   (  3.57%)
      model generation                  0.01 sec   ( 87.30%)
      optimizer                         0.00 sec   ( 12.70%)
    Post-processing                     0.00 sec   (  0.02%)

Performance: choosing a solver¶

For small problems, say less than max_n_prebins<=20, the solver="mip" tends to be faster than solver="cp". However, for medium and large problems, experiments show the contrary. For very large problems, we recommend the use of the commercial solver LocalSolver via solver="ls". See the specific LocalSolver tutorial.

Missing data and special codes¶

For this example, let’s load data from the FICO Explainable Machine Learning Challenge: https://community.fico.com/s/explainable-machine-learning-challenge

[59]:

df = pd.read_csv("data/FICO_challenge/heloc_dataset_v1.csv", sep=",")

The data dictionary of this challenge includes three special values/codes:

-9 No Bureau Record or No Investigation
-8 No Usable/Valid Trades or Inquiries
-7 Condition not Met (e.g. No Inquiries, No Delinquencies)

[60]:

special_codes = [-9, -8, -7]

[61]:

variable = "AverageMInFile"
x = df[variable].values
y = df.RiskPerformance.values

[62]:

df.RiskPerformance.unique()

[62]:

array(['Bad', 'Good'], dtype=object)

Target is a categorical dichotomic variable, which can be easily transform into numerical.

[63]:

mask = y == "Bad"
y[mask] = 1
y[~mask] = 0
y = y.astype(int)

For the sake of completeness, we include a few missing values

[64]:

idx = np.random.randint(0, len(x), 500)
x = x.astype(float)
x[idx] = np.nan

[65]:

optb = OptimalBinning(name=variable, dtype="numerical", solver="mip",
                      special_codes=special_codes)

optb.fit(x, y)

[65]:

OptimalBinning(name='AverageMInFile', solver='mip', special_codes=[-9, -8, -7])

[66]:

optb.information(print_level=1)

optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Name    : AverageMInFile
  Status  : OPTIMAL

  Pre-binning statistics
    Number of pre-bins                    13
    Number of refinements                  0

  Solver statistics
    Type                                 mip
    Number of variables                  174
    Number of constraints                 91
    Objective value                   0.3235
    Best objective bound              0.3235

  Timing
    Total time                          0.10 sec
    Pre-processing                      0.00 sec   (  3.77%)
    Pre-binning                         0.01 sec   (  8.61%)
    Solver                              0.09 sec   ( 86.93%)
    Post-processing                     0.00 sec   (  0.11%)

[67]:

binning_table = optb.binning_table
binning_table.build()

[67]:

	Bin	Count	Count (%)	Non-event	Event	Event rate	WoE	IV	JS
0	(-inf, 30.50)	549	0.052491	98	451	0.821494	-1.438672	0.090659	0.010446
1	[30.50, 48.50)	1044	0.099818	286	758	0.726054	-0.886864	0.072415	0.008766
2	[48.50, 56.50)	698	0.066737	248	450	0.644699	-0.507991	0.016679	0.002063
3	[56.50, 64.50)	928	0.088727	389	539	0.580819	-0.238309	0.004989	0.000622
4	[64.50, 69.50)	661	0.063199	301	360	0.544629	-0.091166	0.000524	0.000065
5	[69.50, 74.50)	679	0.064920	327	352	0.518409	0.014157	0.000013	0.000002
6	[74.50, 81.50)	901	0.086146	466	435	0.482797	0.156667	0.002117	0.000264
7	[81.50, 101.50)	1987	0.189980	1129	858	0.431807	0.362311	0.024865	0.003091
8	[101.50, 116.50)	865	0.082704	540	325	0.375723	0.595572	0.028865	0.003556
9	[116.50, inf)	1089	0.104121	706	383	0.351699	0.699408	0.049686	0.006087
10	Special	565	0.054020	253	312	0.552212	-0.121786	0.000798	0.000100
11	Missing	493	0.047136	257	236	0.478702	0.173072	0.001414	0.000177
Totals		10459	1.000000	5000	5459	0.521943		0.293024	0.035239

Note the dashed bins 10 and 11, corresponding to the special codes bin and the missing bin, respectively.

[68]:

binning_table.plot(metric="event_rate")

../_images/tutorials_tutorial_binary_115_0.png

Treat special codes separately¶

Version 0.13.0 introduced the option to pass a dictionary of special codes to treat them separately. This feature provides more flexibility to the modeller. Note that a special code can be a single value or a list of values, for example, a combination of several special values.

[69]:

special_codes = {'special_1': -9, "special_2": -8, "special_3": -7}

x[10:20] = -8
x[100:105] = -7

optb = OptimalBinning(name=variable, dtype="numerical", solver="mip",
                      special_codes=special_codes)

optb.fit(x, y)

[69]:

OptimalBinning(name='AverageMInFile', solver='mip',
               special_codes={'special_1': -9, 'special_2': -8,
                              'special_3': -7})

[70]:

optb.binning_table.build()

[70]:

	Bin	Count	Count (%)	Non-event	Event	Event rate	WoE	IV	JS
0	(-inf, 30.50)	548	0.052395	98	450	0.821168	-1.436452	0.090256	0.010402
1	[30.50, 48.50)	1042	0.099627	285	757	0.726488	-0.889046	0.072608	0.008788
2	[48.50, 56.50)	698	0.066737	248	450	0.644699	-0.507991	0.016679	0.002063
3	[56.50, 64.50)	927	0.088632	389	538	0.580367	-0.236452	0.004907	0.000612
4	[64.50, 69.50)	658	0.062912	301	357	0.542553	-0.082798	0.000430	0.000054
5	[69.50, 74.50)	679	0.064920	327	352	0.518409	0.014157	0.000013	0.000002
6	[74.50, 81.50)	899	0.085955	466	433	0.481646	0.161276	0.002239	0.000280
7	[81.50, 101.50)	1984	0.189693	1128	856	0.431452	0.363759	0.025025	0.003111
8	[101.50, 116.50)	865	0.082704	540	325	0.375723	0.595572	0.028865	0.003556
9	[116.50, inf)	1087	0.103930	705	382	0.351426	0.700605	0.049760	0.006096
10	special_1	565	0.054020	253	312	0.552212	-0.121786	0.000798	0.000100
11	special_2	10	0.000956	2	8	0.800000	-1.298467	0.001383	0.000162
12	special_3	5	0.000478	2	3	0.600000	-0.317637	0.000048	0.000006
13	Missing	492	0.047041	256	236	0.479675	0.169173	0.001348	0.000168
Totals		10459	1.000000	5000	5459	0.521943		0.294358	0.035398

[71]:

optb.binning_table.plot(metric="event_rate")

../_images/tutorials_tutorial_binary_120_0.png

[72]:

special_codes = {'special_1': -9, "special_comb": [-7, -8]}

optb = OptimalBinning(name=variable, dtype="numerical", solver="mip",
                      special_codes=special_codes)

optb.fit(x, y)

[72]:

OptimalBinning(name='AverageMInFile', solver='mip',
               special_codes={'special_1': -9, 'special_comb': [-7, -8]})

[73]:

optb.binning_table.build()

[73]:

	Bin	Count	Count (%)	Non-event	Event	Event rate	WoE	IV	JS
0	(-inf, 30.50)	548	0.052395	98	450	0.821168	-1.436452	0.090256	0.010402
1	[30.50, 48.50)	1042	0.099627	285	757	0.726488	-0.889046	0.072608	0.008788
2	[48.50, 56.50)	698	0.066737	248	450	0.644699	-0.507991	0.016679	0.002063
3	[56.50, 64.50)	927	0.088632	389	538	0.580367	-0.236452	0.004907	0.000612
4	[64.50, 69.50)	658	0.062912	301	357	0.542553	-0.082798	0.000430	0.000054
5	[69.50, 74.50)	679	0.064920	327	352	0.518409	0.014157	0.000013	0.000002
6	[74.50, 81.50)	899	0.085955	466	433	0.481646	0.161276	0.002239	0.000280
7	[81.50, 101.50)	1984	0.189693	1128	856	0.431452	0.363759	0.025025	0.003111
8	[101.50, 116.50)	865	0.082704	540	325	0.375723	0.595572	0.028865	0.003556
9	[116.50, inf)	1087	0.103930	705	382	0.351426	0.700605	0.049760	0.006096
10	special_1	565	0.054020	253	312	0.552212	-0.121786	0.000798	0.000100
11	special_comb	15	0.001434	4	11	0.733333	-0.923773	0.001122	0.000136
12	Missing	492	0.047041	256	236	0.479675	0.169173	0.001348	0.000168
Totals		10459	1.000000	5000	5459	0.521943		0.294050	0.035366

Version 0.15.1 added the option show_bin_labels to show the bin label instead of the bin id on the x-axis.

[74]:

optb.binning_table.plot(metric="event_rate", show_bin_labels=True)

../_images/tutorials_tutorial_binary_124_0.png

Verbosity option¶

For debugging purposes, we can print information on each step of the computation by triggering the verbose option.

[75]:

optb = OptimalBinning(name=variable, dtype="numerical", solver="mip", verbose=True)
optb.fit(x, y)

2024-01-14 23:33:34,971 | INFO : Optimal binning started.
2024-01-14 23:33:34,973 | INFO : Options: check parameters.
2024-01-14 23:33:34,976 | INFO : Pre-processing started.
2024-01-14 23:33:34,977 | INFO : Pre-processing: number of samples: 10459
2024-01-14 23:33:34,981 | INFO : Pre-processing: number of clean samples: 9967
2024-01-14 23:33:34,982 | INFO : Pre-processing: number of missing samples: 492
2024-01-14 23:33:34,984 | INFO : Pre-processing: number of special samples: 0
2024-01-14 23:33:34,985 | INFO : Pre-processing terminated. Time: 0.0021s
2024-01-14 23:33:34,987 | INFO : Pre-binning started.
2024-01-14 23:33:35,003 | INFO : Pre-binning: number of prebins: 15
2024-01-14 23:33:35,004 | INFO : Pre-binning: number of refinements: 0
2024-01-14 23:33:35,005 | INFO : Pre-binning terminated. Time: 0.0144s
2024-01-14 23:33:35,006 | INFO : Optimizer started.
2024-01-14 23:33:35,008 | INFO : Optimizer: classifier predicts descending monotonic trend.
2024-01-14 23:33:35,010 | INFO : Optimizer: monotonic trend set to descending.
2024-01-14 23:33:35,011 | INFO : Optimizer: build model...
2024-01-14 23:33:35,122 | INFO : Optimizer: solve...
2024-01-14 23:33:35,126 | INFO : Optimizer terminated. Time: 0.1194s
2024-01-14 23:33:35,128 | INFO : Post-processing started.
2024-01-14 23:33:35,130 | INFO : Post-processing: compute binning information.
2024-01-14 23:33:35,133 | INFO : Post-processing terminated. Time: 0.0002s
2024-01-14 23:33:35,135 | INFO : Optimal binning terminated. Status: OPTIMAL. Time: 0.1640s

[75]:

OptimalBinning(name='AverageMInFile', solver='mip', verbose=True)