Tutorial: optimal binning with multiclass target

Basic

To get us started, let’s load a well-known dataset from the UCI repository and transform the data into a pandas.DataFrame.

[1]:
import pandas as pd
from sklearn.datasets import load_wine
[2]:
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)

We choose a variable to discretize and the multiclass target.

[3]:
variable = "ash"
x = df[variable].values
y = data.target

Import and instantiate an MulticlassOptimalBinning object class. We pass the variable name and a solver, in this case, we choose the constraint programming solver.

[4]:
from optbinning import MulticlassOptimalBinning
[5]:
optb = MulticlassOptimalBinning(name=variable, solver="cp")

We fit the optimal binning object with arrays x and y.

[6]:
optb.fit(x, y)
[6]:
MulticlassOptimalBinning(name='ash')

You can check if an optimal solution has been found via the status attribute:

[7]:
optb.status
[7]:
'OPTIMAL'

You can also retrieve the optimal split points via the splits attribute:

[8]:
optb.splits
[8]:
array([2.1450001 , 2.245     , 2.31499994, 2.6049999 , 2.6450001 ])

The binning table

The optimal binning algorithms return a binning table; a binning table displays the binned data and several metrics for each bin. Class OptimalBinning returns an object MulticlassBinningTable via the binning_table attribute.

[9]:
binning_table = optb.binning_table
[10]:
type(binning_table)
[10]:
optbinning.binning.binning_statistics.MulticlassBinningTable

The binning_table is instantiated, but not built. Therefore, the first step is to call the method build, which returns a pandas.DataFrame.

[11]:
binning_table.build()
[11]:
Bin Count Count (%) Event_0 Event_1 Event_2 Event_rate_0 Event_rate_1 Event_rate_2
0 (-inf, 2.15) 31 0.174157 7 23 1 0.225806 0.741935 0.032258
1 [2.15, 2.25) 20 0.112360 2 13 5 0.100000 0.650000 0.250000
2 [2.25, 2.31) 26 0.146067 9 10 7 0.346154 0.384615 0.269231
3 [2.31, 2.60) 64 0.359551 24 17 23 0.375000 0.265625 0.359375
4 [2.60, 2.65) 10 0.056180 4 1 5 0.400000 0.100000 0.500000
5 [2.65, inf) 27 0.151685 13 7 7 0.481481 0.259259 0.259259
6 Special 0 0.000000 0 0 0 0.000000 0.000000 0.000000
7 Missing 0 0.000000 0 0 0 0.000000 0.000000 0.000000
Totals 178 1.000000 59 71 48 0.331461 0.398876 0.269663

Let’s describe the columns of this binning table:

  • Bin: the intervals delimited by the optimal split points.

  • Count: the number of records for each bin.

  • Count (%): the percentage of records for each bin.

  • Event: the number of event records \((y = class)\) for each bin.

  • Event rate: the percentage of event records for each bin. This is computed using one-vs-all or one-vs-rest approach.

The last row shows the total number of records, event records, and event rates.

You can use the method plot to visualize the histogram and event rate curve. Note that the Bin ID corresponds to the binning table index.

[12]:
binning_table.plot()
../_images/tutorials_tutorial_multiclass_24_0.png

Alternatively, the Bin can be visualized using show_bin_labels=True (since version 0.15.1)

[13]:
binning_table.plot(show_bin_labels=True)
../_images/tutorials_tutorial_multiclass_26_0.png

Transformation

Now that we have checked the binned data, we can transform our original data into a measure based on the one-vs-rest WoE metric. You can check the correctness of the transformation using pandas value_counts method, for instance.

[14]:
x_transform_woe = optb.transform(x, metric="bins")
[15]:
pd.Series(x_transform_woe).value_counts()
[15]:
[2.31, 2.60)    64
(-inf, 2.15)    31
[2.65, inf)     27
[2.25, 2.31)    26
[2.15, 2.25)    20
[2.60, 2.65)    10
dtype: int64

Advanced

Many of the advanced options have been covered in the previous tutorials with a binary target. Check it out! In this section, we focus on the binning table statistical analysis and the event rate monotonicity trends.

Binning table statistical analysis

The analysis method performs a statistical analysis of the binning table, computing the Jensen-shannon divergence and the quality score. Additionally, a statistical significance test between consecutive bins of the contingency table is performed using the Chi-square test.

[16]:
binning_table.analysis()
-------------------------------------------------
OptimalBinning: Multiclass Binning Table Analysis
-------------------------------------------------

  General metrics

    JS (Jensen-Shannon)      0.10989515
    HHI                      0.21973236
    HHI (normalized)         0.10826555
    Cramer's V               0.31694075
    Quality score            0.05279822

  Monotonic trend

    Class  0                     valley
    Class  1                     valley
    Class  2                       peak

  Significance tests

    Bin A  Bin B  t-statistic  p-value
        0      1     6.135081 0.046535
        1      2     4.472669 0.106849
        2      3     1.365275 0.505282
        3      4     1.441642 0.486353
        4      5     2.265477 0.322150

Event rate monotonicity

The monotonic_trend option permits forcing a monotonic trend to the event rate curve of each class. The default setting “auto” should be the preferred option, however, some business constraints might require to impose different trends. The default setting “auto” chooses the monotonic trend most likely to maximize the information value from the options “ascending”, “descending”, “peak” and “valley” for each class using a machine-learning-based classifier.

A good starting point to decide which monotonic trend to enforce for each class is to use monotonic_trend=None.

[17]:
optb = MulticlassOptimalBinning(name=variable, solver="mip", monotonic_trend=None)

optb.fit(x, y)
[17]:
MulticlassOptimalBinning(monotonic_trend=None, name='ash', solver='mip')
[18]:
binning_table = optb.binning_table
binning_table.build()
[18]:
Bin Count Count (%) Event_0 Event_1 Event_2 Event_rate_0 Event_rate_1 Event_rate_2
0 (-inf, 2.15) 31 0.174157 7 23 1 0.225806 0.741935 0.032258
1 [2.15, 2.25) 20 0.112360 2 13 5 0.100000 0.650000 0.250000
2 [2.25, 2.28) 15 0.084270 5 5 5 0.333333 0.333333 0.333333
3 [2.28, 2.31) 11 0.061798 4 5 2 0.363636 0.454545 0.181818
4 [2.31, 2.35) 9 0.050562 1 2 6 0.111111 0.222222 0.666667
5 [2.35, 2.39) 12 0.067416 4 2 6 0.333333 0.166667 0.500000
6 [2.39, 2.47) 19 0.106742 10 6 3 0.526316 0.315789 0.157895
7 [2.47, 2.50) 9 0.050562 3 2 4 0.333333 0.222222 0.444444
8 [2.50, 2.60) 15 0.084270 6 5 4 0.400000 0.333333 0.266667
9 [2.60, 2.65) 10 0.056180 4 1 5 0.400000 0.100000 0.500000
10 [2.65, 2.73) 15 0.084270 9 2 4 0.600000 0.133333 0.266667
11 [2.73, inf) 12 0.067416 4 5 3 0.333333 0.416667 0.250000
12 Special 0 0.000000 0 0 0 0.000000 0.000000 0.000000
13 Missing 0 0.000000 0 0 0 0.000000 0.000000 0.000000
Totals 178 1.000000 59 71 48 0.331461 0.398876 0.269663
[19]:
binning_table.plot()
../_images/tutorials_tutorial_multiclass_41_0.png
[20]:
binning_table.analysis()
-------------------------------------------------
OptimalBinning: Multiclass Binning Table Analysis
-------------------------------------------------

  General metrics

    JS (Jensen-Shannon)      0.15199204
    HHI                      0.09683121
    HHI (normalized)         0.02735669
    Cramer's V               0.37833165
    Quality score            0.00403065

  Monotonic trend

    Class  0               no monotonic
    Class  1               no monotonic
    Class  2               no monotonic

  Significance tests

    Bin A  Bin B  t-statistic  p-value
        0      1     6.135081 0.046535
        1      2     4.212963 0.121665
        2      3     0.800385 0.670191
        3      4     4.935065 0.084794
        4      5     1.400000 0.496585
        5      6     4.205201 0.122138
        6      7     2.682861 0.261471
        7      8     0.838095 0.657673
        8      9     2.268519 0.321660
        9     10     1.424501 0.490539
       10     11     3.056044 0.216964

For example, we can decide that we do not care about class 2 and only force classes 0 and 1.

[21]:
optb = MulticlassOptimalBinning(name=variable, solver="mip",
                                monotonic_trend=["ascending", "auto", None],
                                verbose=True)

optb.fit(x, y)
2024-01-15 00:17:04,003 | INFO : Optimal binning started.
2024-01-15 00:17:04,007 | INFO : Options: check parameters.
2024-01-15 00:17:04,009 | INFO : Pre-processing started.
2024-01-15 00:17:04,010 | INFO : Pre-processing: number of samples: 178
2024-01-15 00:17:04,012 | INFO : Pre-processing: number of clean samples: 178
2024-01-15 00:17:04,013 | INFO : Pre-processing: number of missing samples: 0
2024-01-15 00:17:04,015 | INFO : Pre-processing: number of special samples: 0
2024-01-15 00:17:04,016 | INFO : Pre-processing terminated. Time: 0.0002s
2024-01-15 00:17:04,017 | INFO : Pre-binning started.
2024-01-15 00:17:04,026 | INFO : Pre-binning: number prebins removed: 1
2024-01-15 00:17:04,028 | INFO : Pre-binning: number of prebins: 12
2024-01-15 00:17:04,031 | INFO : Pre-binning: number of refinements: 1
2024-01-15 00:17:04,032 | INFO : Pre-binning terminated. Time: 0.0097s
2024-01-15 00:17:04,034 | INFO : Optimizer started.
2024-01-15 00:17:04,039 | INFO : Optimizer: classifier predicts valley monotonic trend.
2024-01-15 00:17:04,040 | INFO : Optimizer: build model...
2024-01-15 00:17:04,241 | INFO : Optimizer: solve...
2024-01-15 00:17:04,248 | INFO : Optimizer terminated. Time: 0.2129s
2024-01-15 00:17:04,250 | INFO : Post-processing started.
2024-01-15 00:17:04,252 | INFO : Post-processing: compute binning information.
2024-01-15 00:17:04,254 | INFO : Post-processing terminated. Time: 0.0004s
2024-01-15 00:17:04,260 | INFO : Optimal binning terminated. Status: OPTIMAL. Time: 0.2568s
[21]:
MulticlassOptimalBinning(monotonic_trend=['ascending', 'auto', None],
                         name='ash', solver='mip', verbose=True)
[22]:
binning_table = optb.binning_table
binning_table.build()
[22]:
Bin Count Count (%) Event_0 Event_1 Event_2 Event_rate_0 Event_rate_1 Event_rate_2
0 (-inf, 2.25) 51 0.286517 9 36 6 0.176471 0.705882 0.117647
1 [2.25, 2.35) 35 0.196629 10 12 13 0.285714 0.342857 0.371429
2 [2.35, 2.39) 12 0.067416 4 2 6 0.333333 0.166667 0.500000
3 [2.39, inf) 80 0.449438 36 21 23 0.450000 0.262500 0.287500
4 Special 0 0.000000 0 0 0 0.000000 0.000000 0.000000
5 Missing 0 0.000000 0 0 0 0.000000 0.000000 0.000000
Totals 178 1.000000 59 71 48 0.331461 0.398876 0.269663
[23]:
binning_table.plot()
../_images/tutorials_tutorial_multiclass_46_0.png
[24]:
binning_table.analysis()
-------------------------------------------------
OptimalBinning: Multiclass Binning Table Analysis
-------------------------------------------------

  General metrics

    JS (Jensen-Shannon)      0.08899633
    HHI                      0.32729453
    HHI (normalized)         0.19275344
    Cramer's V               0.30518651
    Quality score            0.08792312

  Monotonic trend

    Class  0                  ascending
    Class  1            valley (convex)
    Class  2             peak (concave)

  Significance tests

    Bin A  Bin B  t-statistic  p-value
        0      1    12.072711 0.002390
        1      2     1.364733 0.505419
        2      3     2.204828 0.332069

Event rate minimum difference

Since version 0.17.0, the parameter min_event_rate_diff is available for multiclass binning.

[25]:
optb = MulticlassOptimalBinning(name=variable, solver="mip",
                                monotonic_trend=None,
                                min_event_rate_diff=0.03)

optb.fit(x, y)
[25]:
MulticlassOptimalBinning(min_event_rate_diff=0.03, monotonic_trend=None,
                         name='ash', solver='mip')
[26]:
optb.information(print_level=2)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Begin options
    name                                 ash   * U
    prebinning_method                   cart   * d
    solver                               mip   * U
    max_n_prebins                         20   * d
    min_prebin_size                     0.05   * d
    min_n_bins                            no   * d
    max_n_bins                            no   * d
    min_bin_size                          no   * d
    max_bin_size                          no   * d
    monotonic_trend                       no   * U
    min_event_rate_diff                 0.03   * U
    max_pvalue                            no   * d
    max_pvalue_policy            consecutive   * d
    user_splits                           no   * d
    user_splits_fixed                     no   * d
    special_codes                         no   * d
    split_digits                          no   * d
    mip_solver                           bop   * d
    time_limit                           100   * d
    verbose                            False   * d
  End options

  Name    : ash
  Status  : OPTIMAL

  Pre-binning statistics
    Number of pre-bins                    12
    Number of refinements                  1

  Solver statistics
    Type                                 mip
    Number of variables                  196
    Number of constraints                 78
    Objective value                   2.2657
    Best objective bound              2.2657

  Timing
    Total time                          0.05 sec
    Pre-processing                      0.00 sec   (  0.27%)
    Pre-binning                         0.00 sec   (  9.86%)
    Solver                              0.04 sec   ( 88.28%)
    Post-processing                     0.00 sec   (  0.55%)

[27]:
binning_table = optb.binning_table
binning_table.build()
[27]:
Bin Count Count (%) Event_0 Event_1 Event_2 Event_rate_0 Event_rate_1 Event_rate_2
0 (-inf, 2.15) 31 0.174157 7 23 1 0.225806 0.741935 0.032258
1 [2.15, 2.25) 20 0.112360 2 13 5 0.100000 0.650000 0.250000
2 [2.25, 2.28) 15 0.084270 5 5 5 0.333333 0.333333 0.333333
3 [2.28, 2.31) 11 0.061798 4 5 2 0.363636 0.454545 0.181818
4 [2.31, 2.35) 9 0.050562 1 2 6 0.111111 0.222222 0.666667
5 [2.35, 2.39) 12 0.067416 4 2 6 0.333333 0.166667 0.500000
6 [2.39, 2.47) 19 0.106742 10 6 3 0.526316 0.315789 0.157895
7 [2.47, 2.50) 9 0.050562 3 2 4 0.333333 0.222222 0.444444
8 [2.50, 2.60) 15 0.084270 6 5 4 0.400000 0.333333 0.266667
9 [2.60, 2.73) 25 0.140449 13 3 9 0.520000 0.120000 0.360000
10 [2.73, inf) 12 0.067416 4 5 3 0.333333 0.416667 0.250000
11 Special 0 0.000000 0 0 0 0.000000 0.000000 0.000000
12 Missing 0 0.000000 0 0 0 0.000000 0.000000 0.000000
Totals 178 1.000000 59 71 48 0.331461 0.398876 0.269663
[28]:
binning_table.plot()
../_images/tutorials_tutorial_multiclass_53_0.png