Tutorial: optimal binning 2D with binary target¶
As usual, let’s load a well-known dataset from the UCI repository and transform the data into a pandas.DataFrame
.
[1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
[2]:
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
We choose two variables to discretize and the binary target.
[3]:
variable1 = "mean radius"
variable2 = "worst concavity"
x = df[variable1].values
y = df[variable2].values
z = data.target
Import and instantiate an OptimalBinning2D
object class. We pass the variable names (coordinates x and y), and a solver, in this case, we choose the constraint programming solver.
[4]:
from optbinning import OptimalBinning2D
[5]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2, solver="cp")
We fit the optimal binning object with arrays x
, y
, and z
.
[6]:
optb.fit(x, y, z)
[6]:
OptimalBinning2D(name_x='mean radius', name_y='worst concavity')
Similar to other OptBinning classes, you can inspect the attributes status
and splits
. In this case, the splits shown are actually the bins, but the splits
name is used to maintain API homogeneity.
[7]:
optb.status
[7]:
'OPTIMAL'
[8]:
optb.splits
[8]:
([[-inf, 13.704999923706055],
[13.704999923706055, 15.045000076293945],
[15.045000076293945, 16.925000190734863],
[16.925000190734863, inf],
[-inf, 13.09499979019165],
[13.09499979019165, 13.704999923706055],
[15.045000076293945, 16.925000190734863],
[13.09499979019165, 13.704999923706055],
[13.704999923706055, 15.045000076293945],
[15.045000076293945, 16.925000190734863],
[13.09499979019165, 13.704999923706055],
[13.704999923706055, 15.045000076293945],
[15.045000076293945, inf],
[-inf, 13.09499979019165],
[13.09499979019165, 13.704999923706055],
[13.704999923706055, 15.045000076293945]],
[[-inf, 0.20795000344514847],
[-inf, 0.2604999989271164],
[-inf, 0.20795000344514847],
[-inf, 0.31530000269412994],
[0.20795000344514847, 0.37815000116825104],
[0.20795000344514847, 0.2604999989271164],
[0.20795000344514847, 0.2604999989271164],
[0.2604999989271164, 0.31530000269412994],
[0.2604999989271164, 0.31530000269412994],
[0.2604999989271164, 0.31530000269412994],
[0.31530000269412994, 0.37815000116825104],
[0.31530000269412994, 0.37815000116825104],
[0.31530000269412994, inf],
[0.37815000116825104, inf],
[0.37815000116825104, inf],
[0.37815000116825104, inf]])
The binning table¶
The binning table follows the same structure as the unidimensional binning, except for having two Bin columns, one for each variable (coordinate). The option show_bin_xy=True
in method build
combines both columns to obtain a single Bin column.
[9]:
optb.binning_table.build()
[9]:
Bin x | Bin y | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 13.70) | (-inf, 0.21) | 219 | 0.384886 | 1 | 218 | 0.995434 | -4.863346 | 2.946834 | 0.199430 |
1 | [13.70, 15.05) | (-inf, 0.26) | 45 | 0.079086 | 1 | 44 | 0.977778 | -3.263040 | 0.386776 | 0.034251 |
2 | [15.05, 16.93) | (-inf, 0.21) | 8 | 0.014060 | 2 | 6 | 0.750000 | -0.577463 | 0.004257 | 0.000525 |
3 | [16.93, inf) | (-inf, 0.32) | 21 | 0.036907 | 20 | 1 | 0.047619 | 3.516882 | 0.321930 | 0.027320 |
4 | (-inf, 13.09) | [0.21, 0.38) | 48 | 0.084359 | 1 | 47 | 0.979167 | -3.328998 | 0.422569 | 0.037010 |
5 | [13.09, 13.70) | [0.21, 0.26) | 6 | 0.010545 | 1 | 5 | 0.833333 | -1.088288 | 0.010109 | 0.001205 |
6 | [15.05, 16.93) | [0.21, 0.26) | 6 | 0.010545 | 4 | 2 | 0.333333 | 1.214297 | 0.016108 | 0.001898 |
7 | [13.09, 13.70) | [0.26, 0.32) | 4 | 0.007030 | 1 | 3 | 0.750000 | -0.577463 | 0.002129 | 0.000262 |
8 | [13.70, 15.05) | [0.26, 0.32) | 9 | 0.015817 | 5 | 4 | 0.444444 | 0.744293 | 0.009215 | 0.001126 |
9 | [15.05, 16.93) | [0.26, 0.32) | 8 | 0.014060 | 7 | 1 | 0.125000 | 2.467060 | 0.074549 | 0.007501 |
10 | [13.09, 13.70) | [0.32, 0.38) | 7 | 0.012302 | 3 | 4 | 0.571429 | 0.233467 | 0.000688 | 0.000086 |
11 | [13.70, 15.05) | [0.32, 0.38) | 12 | 0.021090 | 7 | 5 | 0.416667 | 0.857622 | 0.016306 | 0.001978 |
12 | [15.05, inf) | [0.32, inf) | 129 | 0.226714 | 128 | 1 | 0.007752 | 5.373180 | 3.229133 | 0.201294 |
13 | (-inf, 13.09) | [0.38, inf) | 22 | 0.038664 | 11 | 11 | 0.500000 | 0.521150 | 0.010983 | 0.001358 |
14 | [13.09, 13.70) | [0.38, inf) | 8 | 0.014060 | 5 | 3 | 0.375000 | 1.031975 | 0.015667 | 0.001876 |
15 | [13.70, 15.05) | [0.38, inf) | 17 | 0.029877 | 15 | 2 | 0.117647 | 2.536053 | 0.165230 | 0.016450 |
16 | Special | Special | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
17 | Missing | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 569 | 1.000000 | 212 | 357 | 0.627417 | 7.632482 | 0.533569 |
[10]:
optb.binning_table.build(show_bin_xy=True)
[10]:
Bin | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 13.70) $\cup$ (-inf, 0.21) | 219 | 0.384886 | 1 | 218 | 0.995434 | -4.863346 | 2.946834 | 0.199430 |
1 | [13.70, 15.05) $\cup$ (-inf, 0.26) | 45 | 0.079086 | 1 | 44 | 0.977778 | -3.263040 | 0.386776 | 0.034251 |
2 | [15.05, 16.93) $\cup$ (-inf, 0.21) | 8 | 0.014060 | 2 | 6 | 0.750000 | -0.577463 | 0.004257 | 0.000525 |
3 | [16.93, inf) $\cup$ (-inf, 0.32) | 21 | 0.036907 | 20 | 1 | 0.047619 | 3.516882 | 0.321930 | 0.027320 |
4 | (-inf, 13.09) $\cup$ [0.21, 0.38) | 48 | 0.084359 | 1 | 47 | 0.979167 | -3.328998 | 0.422569 | 0.037010 |
5 | [13.09, 13.70) $\cup$ [0.21, 0.26) | 6 | 0.010545 | 1 | 5 | 0.833333 | -1.088288 | 0.010109 | 0.001205 |
6 | [15.05, 16.93) $\cup$ [0.21, 0.26) | 6 | 0.010545 | 4 | 2 | 0.333333 | 1.214297 | 0.016108 | 0.001898 |
7 | [13.09, 13.70) $\cup$ [0.26, 0.32) | 4 | 0.007030 | 1 | 3 | 0.750000 | -0.577463 | 0.002129 | 0.000262 |
8 | [13.70, 15.05) $\cup$ [0.26, 0.32) | 9 | 0.015817 | 5 | 4 | 0.444444 | 0.744293 | 0.009215 | 0.001126 |
9 | [15.05, 16.93) $\cup$ [0.26, 0.32) | 8 | 0.014060 | 7 | 1 | 0.125000 | 2.467060 | 0.074549 | 0.007501 |
10 | [13.09, 13.70) $\cup$ [0.32, 0.38) | 7 | 0.012302 | 3 | 4 | 0.571429 | 0.233467 | 0.000688 | 0.000086 |
11 | [13.70, 15.05) $\cup$ [0.32, 0.38) | 12 | 0.021090 | 7 | 5 | 0.416667 | 0.857622 | 0.016306 | 0.001978 |
12 | [15.05, inf) $\cup$ [0.32, inf) | 129 | 0.226714 | 128 | 1 | 0.007752 | 5.373180 | 3.229133 | 0.201294 |
13 | (-inf, 13.09) $\cup$ [0.38, inf) | 22 | 0.038664 | 11 | 11 | 0.500000 | 0.521150 | 0.010983 | 0.001358 |
14 | [13.09, 13.70) $\cup$ [0.38, inf) | 8 | 0.014060 | 5 | 3 | 0.375000 | 1.031975 | 0.015667 | 0.001876 |
15 | [13.70, 15.05) $\cup$ [0.38, inf) | 17 | 0.029877 | 15 | 2 | 0.117647 | 2.536053 | 0.165230 | 0.016450 |
16 | Special | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
17 | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 569 | 1.000000 | 212 | 357 | 0.627417 | 7.632482 | 0.533569 |
You can use the method plot
to visualize the histogram 2D and WoE or event rate curve. Note that the Bin ID corresponds to the binning table index. These are the key points to correctly interpret the plots belows:
Bins can only be rectangles. If a bin is composed by \(m\) squares, the Bin ID is shown \(m\) times.
The upper left plot shows the WoE/event rate on the x-axis; the bin paths left-right.
The lower right plot shows the WoE/event rate on the y-axis; the bin paths top-down.
[11]:
optb.binning_table.plot(metric="woe")
[12]:
optb.binning_table.plot(metric="event_rate")
Event rate / WoE transformation¶
Now that we have checked the binned data, we can transform our original data into WoE or event rate values. You can check the correctness of the transformation using pandas value_counts
method, for instance. Note that both x
and y
are required, and a single array is returned with the transformation.
[13]:
z_transform_woe = optb.transform(x, y, metric="woe")
pd.Series(z_transform_woe).value_counts()
[13]:
-4.863346 219
5.373180 129
-3.328998 48
-3.263040 45
0.521150 22
3.516882 21
2.536053 17
0.857622 12
-0.577463 12
0.744293 9
2.467060 8
1.031975 8
0.233467 7
-1.088288 6
1.214297 6
dtype: int64
[14]:
z_transform_event_rate = optb.transform(x, y, metric="event_rate")
pd.Series(z_transform_event_rate).value_counts()
[14]:
0.995434 219
0.007752 129
0.979167 48
0.977778 45
0.500000 22
0.047619 21
0.117647 17
0.416667 12
0.750000 12
0.444444 9
0.125000 8
0.375000 8
0.571429 7
0.833333 6
0.333333 6
dtype: int64
[15]:
z_transform_indices = optb.transform(x, y, metric="indices")
pd.Series(z_transform_indices).value_counts()
[15]:
0 219
12 129
4 48
1 45
13 22
3 21
15 17
11 12
8 9
2 8
9 8
14 8
10 7
5 6
6 6
7 4
dtype: int64
If metric="bins"
the bin ids are combined.
[16]:
z_transform_bins = optb.transform(x, y, metric="bins")
[17]:
pd.Series(z_transform_bins).value_counts()
[17]:
(-inf, 13.70) $\cup$ (-inf, 0.21) 219
[15.05, inf) $\cup$ [0.32, inf) 129
(-inf, 13.09) $\cup$ [0.21, 0.38) 48
[13.70, 15.05) $\cup$ (-inf, 0.26) 45
(-inf, 13.09) $\cup$ [0.38, inf) 22
[16.93, inf) $\cup$ (-inf, 0.32) 21
[13.70, 15.05) $\cup$ [0.38, inf) 17
[13.70, 15.05) $\cup$ [0.32, 0.38) 12
[13.70, 15.05) $\cup$ [0.26, 0.32) 9
[15.05, 16.93) $\cup$ [0.26, 0.32) 8
[13.09, 13.70) $\cup$ [0.38, inf) 8
[15.05, 16.93) $\cup$ (-inf, 0.21) 8
[13.09, 13.70) $\cup$ [0.32, 0.38) 7
[15.05, 16.93) $\cup$ [0.21, 0.26) 6
[13.09, 13.70) $\cup$ [0.21, 0.26) 6
[13.09, 13.70) $\cup$ [0.26, 0.32) 4
dtype: int64
Binning table statistical analysis¶
The analysis
method performs a statistical analysis of the binning table, computing the statistics Gini index, Information Value (IV), Jensen-Shannon divergence, and the quality score. The report is the same that the one for unidimensional binning with a binary target. The main difference is that the significant tests for each bin are performed with respect to all its linked bins.
[18]:
optb.binning_table.analysis()
------------------------------------------------
OptimalBinning: Binary Binning Table 2D Analysis
------------------------------------------------
General metrics
Gini index 0.96381005
IV (Jeffrey) 7.63248244
JS (Jensen-Shannon) 0.53356918
Hellinger 0.66868014
Triangular 1.62726969
KS 0.77651815
HHI 0.21836787
HHI (normalized) 0.17238951
Cramer's V 0.89619441
Quality score 0.00000000
Significance tests
Bin A Bin B t-statistic p-value P[A > B] P[B > A]
0 1 1.547799 2.134607e-01 0.832082 1.679183e-01
0 4 1.401336 2.365000e-01 0.822661 1.773392e-01
0 5 17.418530 2.998882e-05 0.977759 2.224079e-02
1 2 6.599481 1.020085e-02 0.983958 1.604164e-02
1 6 24.864348 6.150953e-07 0.999999 1.269024e-06
1 8 21.600000 3.358518e-06 0.999997 3.065628e-06
2 3 15.607452 7.794679e-05 0.999984 1.596851e-05
2 6 2.430556 1.189907e-01 0.954856 4.514395e-02
3 12 2.181916 1.396405e-01 0.865053 1.349470e-01
4 5 3.180288 7.453157e-02 0.903885 9.611469e-02
4 7 5.243333 2.203102e-02 0.940065 5.993476e-02
4 10 15.060326 1.041290e-04 0.999472 5.281334e-04
4 13 24.385177 7.887324e-07 1.000000 2.478198e-08
5 1 2.931633 8.685961e-02 0.101859 8.981415e-01
5 7 0.104167 7.468856e-01 0.625007 3.749931e-01
6 3 3.857143 4.953461e-02 0.966923 3.307725e-02
6 9 0.883838 3.471525e-01 0.848618 1.513821e-01
7 8 1.040344 3.077415e-01 0.878999 1.210008e-01
7 10 0.350765 5.536802e-01 0.762036 2.379644e-01
8 9 2.081713 1.490728e-01 0.948990 5.101023e-02
8 11 0.016204 8.987079e-01 0.550579 4.494212e-01
9 3 0.540234 4.623359e-01 0.740855 2.591449e-01
9 12 7.198597 7.296061e-03 0.948279 5.172121e-02
10 11 0.424735 5.145836e-01 0.753507 2.464934e-01
10 14 0.578763 4.467977e-01 0.791467 2.085325e-01
11 12 45.057860 1.912979e-11 0.999999 8.195706e-07
11 15 3.434842 6.383469e-02 0.973833 2.616650e-02
13 14 0.368304 5.439304e-01 0.741788 2.582125e-01
14 15 2.251838 1.334558e-01 0.933081 6.691866e-02
15 12 9.013449 2.680002e-03 0.988496 1.150434e-02
The OptimalBinning2D
can print overview information about the options settings, problem statistics, and the solution of the computation. Use print_level=2
, to include the list of all options.
[19]:
optb.information(print_level=2)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Begin options
name_x mean radius * U
name_y worst concavity * U
dtype_x numerical * d
dtype_y numerical * d
prebinning_method cart * d
strategy grid * d
solver cp * d
divergence iv * d
max_n_prebins_x 5 * d
max_n_prebins_y 5 * d
min_prebin_size_x 0.05 * d
min_prebin_size_y 0.05 * d
min_n_bins no * d
max_n_bins no * d
min_bin_size no * d
max_bin_size no * d
min_bin_n_nonevent no * d
max_bin_n_nonevent no * d
min_bin_n_event no * d
max_bin_n_event no * d
monotonic_trend_x no * d
monotonic_trend_y no * d
min_event_rate_diff_x 0 * d
min_event_rate_diff_y 0 * d
gamma 0 * d
special_codes_x no * d
special_codes_y no * d
split_digits no * d
n_jobs 1 * d
time_limit 100 * d
verbose False * d
End options
Name : mean radius-worst concavity
Status : OPTIMAL
Pre-binning statistics
Number of pre-bins 25
Number of refinements 17
Solver statistics
Type cp
Number of booleans 10
Number of branches 24
Number of conflicts 0
Objective value 7632473
Best objective bound 7632473
Timing
Total time 0.06 sec
Pre-processing 0.00 sec ( 0.55%)
Pre-binning 0.00 sec ( 4.77%)
Solver 0.05 sec ( 90.56%)
model generation 0.05 sec ( 84.84%)
optimizer 0.01 sec ( 15.16%)
Post-processing 0.00 sec ( 3.21%)
Event rate / WoE monotonicity¶
The monotonic_trend_x
and monotonic_trend_y
options permit forcing a monotonic trend to the event rate curve on each axis. By default, both options are set to None. There are two options available: “ascending” and “descending”. In this example, we force both trends to be “descending”, and a minimum bin size of 0.025 (2.5%).
[20]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2, monotonic_trend_x="descending",
monotonic_trend_y="descending", min_bin_size=0.025)
optb.fit(x, y, z)
[20]:
OptimalBinning2D(min_bin_size=0.025, monotonic_trend_x='descending',
monotonic_trend_y='descending', name_x='mean radius',
name_y='worst concavity')
[21]:
optb.binning_table.build()
[21]:
Bin x | Bin y | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 13.70) | (-inf, 0.21) | 219 | 0.384886 | 1 | 218 | 0.995434 | -4.863346 | 2.946834 | 0.199430 |
1 | [13.70, 15.05) | (-inf, 0.21) | 37 | 0.065026 | 1 | 36 | 0.972973 | -3.062369 | 0.294365 | 0.026948 |
2 | [15.05, 16.93) | (-inf, 0.32) | 22 | 0.038664 | 13 | 9 | 0.409091 | 0.888874 | 0.032098 | 0.003885 |
3 | [16.93, inf) | (-inf, 0.32) | 21 | 0.036907 | 20 | 1 | 0.047619 | 3.516882 | 0.321930 | 0.027320 |
4 | (-inf, 13.09) | [0.21, 0.38) | 48 | 0.084359 | 1 | 47 | 0.979167 | -3.328998 | 0.422569 | 0.037010 |
5 | [13.09, 15.05) | [0.21, 0.32) | 27 | 0.047452 | 7 | 20 | 0.740741 | -0.528673 | 0.012161 | 0.001503 |
6 | [13.09, 15.05) | [0.32, 0.38) | 19 | 0.033392 | 10 | 9 | 0.473684 | 0.626510 | 0.013758 | 0.001692 |
7 | [15.05, inf) | [0.32, inf) | 129 | 0.226714 | 128 | 1 | 0.007752 | 5.373180 | 3.229133 | 0.201294 |
8 | (-inf, 13.70) | [0.38, inf) | 30 | 0.052724 | 16 | 14 | 0.466667 | 0.654681 | 0.023736 | 0.002915 |
9 | [13.70, 15.05) | [0.38, inf) | 17 | 0.029877 | 15 | 2 | 0.117647 | 2.536053 | 0.165230 | 0.016450 |
10 | Special | Special | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
11 | Missing | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 569 | 1.000000 | 212 | 357 | 0.627417 | 7.461814 | 0.518447 |
[22]:
optb.binning_table.plot(metric="event_rate")
[23]:
optb.binning_table.analysis()
------------------------------------------------
OptimalBinning: Binary Binning Table 2D Analysis
------------------------------------------------
General metrics
Gini index 0.95655621
IV (Jeffrey) 7.46181417
JS (Jensen-Shannon) 0.51844733
Hellinger 0.65109672
Triangular 1.57873476
KS 0.72434068
HHI 0.22077705
HHI (normalized) 0.14993861
Cramer's V 0.88274577
Quality score 0.00000000
Significance tests
Bin A Bin B t-statistic p-value P[A > B] P[B > A]
0 1 2.060028 1.512074e-01 0.858293 1.417071e-01
0 4 1.401336 2.365000e-01 0.822661 1.773392e-01
0 5 49.557596 1.926326e-12 1.000000 3.782008e-11
1 2 24.238873 8.509727e-07 1.000000 9.439925e-09
1 5 7.696840 5.531760e-03 0.998790 1.209527e-03
2 3 7.865847 5.037723e-03 0.999413 5.874914e-04
2 7 48.954531 2.619654e-12 1.000000 6.888954e-10
3 7 2.181916 1.396405e-01 0.865053 1.349470e-01
4 5 10.308808 1.323968e-03 0.999676 3.238203e-04
4 6 25.345514 4.792657e-07 1.000000 1.762382e-08
4 8 28.448939 9.620246e-08 1.000000 4.164320e-10
5 2 5.519682 1.880367e-02 0.992461 7.538668e-03
5 6 3.413773 6.465445e-02 0.970465 2.953487e-02
6 7 57.065224 4.215951e-14 1.000000 1.266293e-10
6 8 0.002300 9.617489e-01 0.518436 4.815645e-01
6 9 5.359977 2.060404e-02 0.994187 5.812630e-03
8 9 5.886891 1.525401e-02 0.996741 3.258617e-03
9 7 9.013449 2.680002e-03 0.988496 1.150434e-02
Reduction of dominating bins¶
To produce more homogeneous bins, the formulation includes a constraint to reduce the difference between the largest and smallest bin. The added regularization parameter gamma
controls the importance of the reduction term. Larger values specify stronger regularization.
[24]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2, gamma=600)
optb.fit(x, y, z)
[24]:
OptimalBinning2D(gamma=600, name_x='mean radius', name_y='worst concavity')
[25]:
optb.binning_table.build()
[25]:
Bin x | Bin y | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 13.09) | (-inf, 0.21) | 195 | 0.342707 | 1 | 194 | 0.994872 | -4.746709 | 2.557054 | 0.176405 |
1 | [13.09, 16.93) | (-inf, 0.26) | 89 | 0.156415 | 8 | 81 | 0.910112 | -1.793858 | 0.339317 | 0.037510 |
2 | [16.93, inf) | (-inf, inf) | 118 | 0.207381 | 117 | 1 | 0.008475 | 5.283323 | 2.900997 | 0.183436 |
3 | (-inf, 13.09) | [0.21, inf) | 70 | 0.123023 | 12 | 58 | 0.828571 | -1.054387 | 0.111619 | 0.013340 |
4 | [13.09, 16.93) | [0.26, inf) | 97 | 0.170475 | 74 | 23 | 0.237113 | 1.689720 | 0.480947 | 0.053853 |
5 | Special | Special | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
6 | Missing | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 569 | 1.000000 | 212 | 357 | 0.627417 | 6.389933 | 0.464545 |
[26]:
optb.binning_table.plot(metric="event_rate")
Missing data and special codes¶
For this example, let’s load data from the FICO Explainable Machine Learning Challenge: https://community.fico.com/s/explainable-machine-learning-challenge
[27]:
df = pd.read_csv("data/FICO_challenge/heloc_dataset_v1.csv", sep=",")
The data dictionary of this challenge includes three special values/codes:
-9 No Bureau Record or No Investigation
-8 No Usable/Valid Trades or Inquiries
-7 Condition not Met (e.g. No Inquiries, No Delinquencies)
All three special codes are considered for both variables.
[28]:
special_codes_x = [-9, -8, -7]
special_codes_y = [-9, -8, -7]
[29]:
variable1 = "AverageMInFile"
variable2 = "MSinceOldestTradeOpen"
x = df[variable1].values
y = df[variable2].values
z = df.RiskPerformance.values
mask = z == "Bad"
z[mask] = 1
z[~mask] = 0
z = z.astype(int)
For the sake of completeness, we include a few missing values
[30]:
idx = np.random.randint(0, len(x), 500)
x = x.astype(float)
x[idx] = np.nan
[31]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2,
monotonic_trend_y="ascending",
special_codes_x=special_codes_x,
special_codes_y=special_codes_y)
optb.fit(x, y, z)
[31]:
OptimalBinning2D(monotonic_trend_y='ascending', name_x='AverageMInFile',
name_y='MSinceOldestTradeOpen', special_codes_x=[-9, -8, -7],
special_codes_y=[-9, -8, -7])
[32]:
optb.binning_table.build()
[32]:
Bin x | Bin y | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 48.50) | (-inf, inf) | 1556 | 0.148289 | 374 | 1182 | 0.759640 | -1.063165 | 0.150230 | 0.017941 |
1 | [48.50, 64.50) | (-inf, 184.50) | 1215 | 0.115791 | 480 | 735 | 0.604938 | -0.338542 | 0.013050 | 0.001623 |
2 | [64.50, 81.50) | (-inf, inf) | 2193 | 0.208996 | 1086 | 1107 | 0.504788 | 0.068390 | 0.000979 | 0.000122 |
3 | [81.50, 101.50) | (-inf, inf) | 1957 | 0.186505 | 1118 | 839 | 0.428717 | 0.374629 | 0.026085 | 0.003242 |
4 | [101.50, inf) | (-inf, inf) | 1898 | 0.180882 | 1205 | 693 | 0.365121 | 0.640748 | 0.072809 | 0.008949 |
5 | [48.50, 64.50) | [184.50, inf) | 360 | 0.034309 | 140 | 220 | 0.611111 | -0.364442 | 0.004472 | 0.000556 |
6 | Special | Special | 827 | 0.078814 | 375 | 452 | 0.546554 | -0.099213 | 0.000773 | 0.000097 |
7 | Missing | Missing | 487 | 0.046412 | 239 | 248 | 0.509240 | 0.050578 | 0.000119 | 0.000015 |
Totals | 10493 | 1.000000 | 5017 | 5476 | 0.521872 | 0.268516 | 0.032545 |
[33]:
optb.binning_table.plot(metric="event_rate")
Note that the special and missing bins are not included in the plot above.
Strategy CART (Experimental)¶
In this last section, provide guidance to handle large grids. These large grids are generated when the parameters max_n_prebins_*
increase (the default value is set to 5). The performance of the optimization solvers CP and MIP is instance dependent. Based on experiments, the CP solver tends to perform better on very large grids, but on small and medium sizes the MIP can be often faster.
10 prebins (100 grid elements)¶
[34]:
variable1 = "ExternalRiskEstimate"
variable2 = "AverageMInFile"
x = df[variable1].values
y = df[variable2].values
[35]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2,
solver="cp",
monotonic_trend_x="descending",
monotonic_trend_y="descending",
max_n_prebins_x=10, max_n_prebins_y=10,
min_bin_size=0.05,
special_codes_x=special_codes_x,
special_codes_y=special_codes_y)
optb.fit(x, y, z)
[35]:
OptimalBinning2D(max_n_prebins_x=10, max_n_prebins_y=10, min_bin_size=0.05,
monotonic_trend_x='descending', monotonic_trend_y='descending',
name_x='ExternalRiskEstimate', name_y='AverageMInFile',
special_codes_x=[-9, -8, -7], special_codes_y=[-9, -8, -7])
[36]:
optb.binning_table.build()
[36]:
Bin x | Bin y | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 63.50) | (-inf, 54.50) | 749 | 0.071613 | 87 | 662 | 0.883845 | -1.941530 | 0.201662 | 0.021871 |
1 | [63.50, 70.50) | (-inf, 54.50) | 765 | 0.073143 | 187 | 578 | 0.755556 | -1.040638 | 0.071263 | 0.008527 |
2 | [70.50, 78.50) | (-inf, 64.50) | 815 | 0.077923 | 343 | 472 | 0.579141 | -0.231421 | 0.004134 | 0.000516 |
3 | [78.50, 80.50) | (-inf, inf) | 588 | 0.056220 | 405 | 183 | 0.311224 | 0.882229 | 0.041886 | 0.005072 |
4 | [80.50, inf) | (-inf, 74.50) | 563 | 0.053829 | 405 | 158 | 0.280639 | 1.029120 | 0.053573 | 0.006416 |
5 | (-inf, 59.50) | [54.50, inf) | 746 | 0.071326 | 131 | 615 | 0.824397 | -1.458597 | 0.126107 | 0.014500 |
6 | [59.50, 67.50) | [54.50, 81.50) | 828 | 0.079166 | 213 | 615 | 0.742754 | -0.972502 | 0.068132 | 0.008196 |
7 | [67.50, 70.50) | [54.50, inf) | 735 | 0.070274 | 309 | 426 | 0.579592 | -0.233270 | 0.003787 | 0.000472 |
8 | [70.50, 75.50) | [64.50, inf) | 1049 | 0.100296 | 596 | 453 | 0.431840 | 0.362176 | 0.013117 | 0.001631 |
9 | [75.50, 78.50) | [64.50, inf) | 549 | 0.052491 | 372 | 177 | 0.322404 | 0.830572 | 0.034864 | 0.004237 |
10 | [80.50, 84.50) | [74.50, inf) | 668 | 0.063868 | 528 | 140 | 0.209581 | 1.415282 | 0.113158 | 0.013071 |
11 | [84.50, inf) | [74.50, inf) | 1080 | 0.103260 | 917 | 163 | 0.150926 | 1.815185 | 0.278705 | 0.030726 |
12 | [59.50, 67.50) | [81.50, inf) | 726 | 0.069414 | 240 | 486 | 0.669421 | -0.617742 | 0.025344 | 0.003119 |
13 | Special | Special | 598 | 0.057176 | 267 | 331 | 0.553512 | -0.127042 | 0.000919 | 0.000115 |
14 | Missing | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 10459 | 1.000000 | 5000 | 5459 | 0.521943 | 1.036652 | 0.118468 |
[37]:
optb.binning_table.plot(metric="event_rate")
[38]:
optb.information(print_level=1)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Name : ExternalRiskEstimate-AverageMInFile
Status : OPTIMAL
Pre-binning statistics
Number of pre-bins 100
Number of refinements 844
Solver statistics
Type cp
Number of booleans 2154
Number of branches 19433
Number of conflicts 8439
Objective value 1098271
Best objective bound 1098271
Timing
Total time 11.72 sec
Pre-processing 0.00 sec ( 0.04%)
Pre-binning 0.02 sec ( 0.14%)
Solver 11.69 sec ( 99.75%)
model generation 2.96 sec ( 25.30%)
optimizer 8.73 sec ( 74.70%)
Post-processing 0.00 sec ( 0.02%)
[39]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2,
solver="mip",
monotonic_trend_x="descending",
monotonic_trend_y="descending",
max_n_prebins_x=10, max_n_prebins_y=10,
min_bin_size=0.05,
special_codes_x=special_codes_x,
special_codes_y=special_codes_y)
optb.fit(x, y, z)
[39]:
OptimalBinning2D(max_n_prebins_x=10, max_n_prebins_y=10, min_bin_size=0.05,
monotonic_trend_x='descending', monotonic_trend_y='descending',
name_x='ExternalRiskEstimate', name_y='AverageMInFile',
solver='mip', special_codes_x=[-9, -8, -7],
special_codes_y=[-9, -8, -7])
[40]:
optb.information(print_level=1)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Name : ExternalRiskEstimate-AverageMInFile
Status : OPTIMAL
Pre-binning statistics
Number of pre-bins 100
Number of refinements 844
Solver statistics
Type mip
Number of variables 1596
Number of constraints 2181
Objective value 1.0983
Best objective bound 1.0983
Timing
Total time 5.78 sec
Pre-processing 0.00 sec ( 0.06%)
Pre-binning 0.02 sec ( 0.28%)
Solver 5.75 sec ( 99.53%)
Post-processing 0.00 sec ( 0.03%)
In this case, the MIP solver reduces the CPU time by 40%. The default strategy to perform refinements is set to strategy="grid"
. Alternatively, when setting strategy="cart"
, a decision tree is used to reduce the space search by merging not relevant pre-bins. This procedure accelerates the solution of the optimization problem at the expense of worsening the total IV.
[41]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2,
solver="cp", strategy="cart",
monotonic_trend_x="descending",
monotonic_trend_y="descending",
max_n_prebins_x=10, max_n_prebins_y=10,
min_bin_size=0.05,
special_codes_x=special_codes_x,
special_codes_y=special_codes_y)
optb.fit(x, y, z)
[41]:
OptimalBinning2D(max_n_prebins_x=10, max_n_prebins_y=10, min_bin_size=0.05,
monotonic_trend_x='descending', monotonic_trend_y='descending',
name_x='ExternalRiskEstimate', name_y='AverageMInFile',
special_codes_x=[-9, -8, -7], special_codes_y=[-9, -8, -7],
strategy='cart')
[42]:
optb.binning_table.build()
[42]:
Bin x | Bin y | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 63.50) | (-inf, 54.50) | 749 | 0.071613 | 87 | 662 | 0.883845 | -1.941530 | 0.201662 | 0.021871 |
1 | [63.50, 73.50) | (-inf, 48.50) | 768 | 0.073430 | 181 | 587 | 0.764323 | -1.088700 | 0.077656 | 0.009254 |
2 | [73.50, 80.50) | (-inf, 64.50) | 620 | 0.059279 | 324 | 296 | 0.477419 | 0.178212 | 0.001885 | 0.000235 |
3 | [80.50, inf) | (-inf, 74.50) | 563 | 0.053829 | 405 | 158 | 0.280639 | 1.029120 | 0.053573 | 0.006416 |
4 | [63.50, 73.50) | [48.50, 64.50) | 661 | 0.063199 | 235 | 426 | 0.644478 | -0.507026 | 0.015736 | 0.001946 |
5 | (-inf, 59.50) | [54.50, inf) | 746 | 0.071326 | 131 | 615 | 0.824397 | -1.458597 | 0.126107 | 0.014500 |
6 | [59.50, 63.50) | [54.50, inf) | 683 | 0.065303 | 176 | 507 | 0.742313 | -0.970199 | 0.055955 | 0.006732 |
7 | [63.50, 70.50) | [64.50, inf) | 1314 | 0.125633 | 488 | 826 | 0.628615 | -0.438452 | 0.023549 | 0.002920 |
8 | [70.50, 75.50) | [64.50, inf) | 1049 | 0.100296 | 596 | 453 | 0.431840 | 0.362176 | 0.013117 | 0.001631 |
9 | [75.50, 80.50) | [64.50, inf) | 960 | 0.091787 | 665 | 295 | 0.307292 | 0.900639 | 0.071115 | 0.008601 |
10 | [80.50, 84.50) | [74.50, inf) | 668 | 0.063868 | 528 | 140 | 0.209581 | 1.415282 | 0.113158 | 0.013071 |
11 | [84.50, inf) | [74.50, inf) | 1080 | 0.103260 | 917 | 163 | 0.150926 | 1.815185 | 0.278705 | 0.030726 |
12 | Special | Special | 598 | 0.057176 | 267 | 331 | 0.553512 | -0.127042 | 0.000919 | 0.000115 |
13 | Missing | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 10459 | 1.000000 | 5000 | 5459 | 0.521943 | 1.033139 | 0.118019 |
[43]:
optb.binning_table.plot(metric="event_rate")
[44]:
optb.information(print_level=1)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Name : ExternalRiskEstimate-AverageMInFile
Status : OPTIMAL
Pre-binning statistics
Number of pre-bins 100
Number of refinements 2802
Solver statistics
Type cp
Number of booleans 145
Number of branches 1962
Number of conflicts 594
Objective value 1094546
Best objective bound 1094546
Timing
Total time 0.57 sec
Pre-processing 0.00 sec ( 0.40%)
Pre-binning 0.02 sec ( 4.13%)
Solver 0.55 sec ( 95.19%)
model generation 0.38 sec ( 68.68%)
optimizer 0.17 sec ( 31.32%)
Post-processing 0.00 sec ( 0.15%)
We get a 21x speedup at the cost of -0.34% reduction in IV.
20 prebins (400 grid elements)¶
[45]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2,
solver="cp",
monotonic_trend_x="descending",
monotonic_trend_y="descending",
max_n_prebins_x=20, max_n_prebins_y=20,
min_bin_size=0.05,
special_codes_x=special_codes_x,
special_codes_y=special_codes_y)
optb.fit(x, y, z)
[45]:
OptimalBinning2D(max_n_prebins_x=20, max_n_prebins_y=20, min_bin_size=0.05,
monotonic_trend_x='descending', monotonic_trend_y='descending',
name_x='ExternalRiskEstimate', name_y='AverageMInFile',
special_codes_x=[-9, -8, -7], special_codes_y=[-9, -8, -7])
[46]:
optb.binning_table.build()
[46]:
Bin x | Bin y | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 63.50) | (-inf, 54.50) | 749 | 0.071613 | 87 | 662 | 0.883845 | -1.941530 | 0.201662 | 0.021871 |
1 | [63.50, 75.50) | (-inf, 41.50) | 571 | 0.054594 | 141 | 430 | 0.753065 | -1.027198 | 0.051944 | 0.006222 |
2 | [75.50, 87.50) | (-inf, 64.50) | 698 | 0.066737 | 413 | 285 | 0.408309 | 0.458786 | 0.013944 | 0.001728 |
3 | [87.50, inf) | (-inf, inf) | 616 | 0.058897 | 524 | 92 | 0.149351 | 1.827531 | 0.160726 | 0.017692 |
4 | [63.50, 67.50) | [41.50, 74.50) | 550 | 0.052586 | 154 | 396 | 0.720000 | -0.856634 | 0.035757 | 0.004338 |
5 | [67.50, 73.50) | [41.50, 64.50) | 548 | 0.052395 | 194 | 354 | 0.645985 | -0.513611 | 0.013378 | 0.001654 |
6 | [73.50, 75.50) | [41.50, inf) | 552 | 0.052778 | 314 | 238 | 0.431159 | 0.364950 | 0.007008 | 0.000871 |
7 | (-inf, 59.50) | [54.50, inf) | 746 | 0.071326 | 131 | 615 | 0.824397 | -1.458597 | 0.126107 | 0.014500 |
8 | [59.50, 63.50) | [54.50, inf) | 683 | 0.065303 | 176 | 507 | 0.742313 | -0.970199 | 0.055955 | 0.006732 |
9 | [67.50, 70.50) | [64.50, inf) | 599 | 0.057271 | 256 | 343 | 0.572621 | -0.204725 | 0.002381 | 0.000297 |
10 | [70.50, 73.50) | [64.50, inf) | 639 | 0.061096 | 354 | 285 | 0.446009 | 0.304635 | 0.005664 | 0.000705 |
11 | [75.50, 80.50) | [64.50, inf) | 960 | 0.091787 | 665 | 295 | 0.307292 | 0.900639 | 0.071115 | 0.008601 |
12 | [80.50, 84.50) | [64.50, inf) | 814 | 0.077828 | 636 | 178 | 0.218673 | 1.361243 | 0.128764 | 0.014958 |
13 | [84.50, 87.50) | [64.50, inf) | 606 | 0.057941 | 510 | 96 | 0.158416 | 1.757890 | 0.148391 | 0.016478 |
14 | [63.50, 67.50) | [74.50, inf) | 530 | 0.050674 | 178 | 352 | 0.664151 | -0.594020 | 0.017156 | 0.002113 |
15 | Special | Special | 598 | 0.057176 | 267 | 331 | 0.553512 | -0.127042 | 0.000919 | 0.000115 |
16 | Missing | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 10459 | 1.000000 | 5000 | 5459 | 0.521943 | 1.040872 | 0.118874 |
[47]:
optb.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Name : ExternalRiskEstimate-AverageMInFile
Status : OPTIMAL
Pre-binning statistics
Number of pre-bins 168
Number of refinements 2524
Solver statistics
Type cp
Number of booleans 5616
Number of branches 25469
Number of conflicts 13345
Objective value 1102734
Best objective bound 1102734
Timing
Total time 41.69 sec
Pre-processing 0.00 sec ( 0.01%)
Pre-binning 0.02 sec ( 0.04%)
Solver 41.64 sec ( 99.88%)
model generation 16.30 sec ( 39.13%)
optimizer 25.35 sec ( 60.87%)
Post-processing 0.00 sec ( 0.01%)
[48]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2,
solver="mip",
monotonic_trend_x="descending",
monotonic_trend_y="descending",
max_n_prebins_x=20, max_n_prebins_y=20,
min_bin_size=0.05,
special_codes_x=special_codes_x,
special_codes_y=special_codes_y)
optb.fit(x, y, z)
[48]:
OptimalBinning2D(max_n_prebins_x=20, max_n_prebins_y=20, min_bin_size=0.05,
monotonic_trend_x='descending', monotonic_trend_y='descending',
name_x='ExternalRiskEstimate', name_y='AverageMInFile',
solver='mip', special_codes_x=[-9, -8, -7],
special_codes_y=[-9, -8, -7])
[49]:
optb.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Name : ExternalRiskEstimate-AverageMInFile
Status : OPTIMAL
Pre-binning statistics
Number of pre-bins 168
Number of refinements 2524
Solver statistics
Type mip
Number of variables 4098
Number of constraints 5666
Objective value 1.1027
Best objective bound 1.1027
Timing
Total time 51.33 sec
Pre-processing 0.00 sec ( 0.01%)
Pre-binning 0.02 sec ( 0.04%)
Solver 51.28 sec ( 99.91%)
Post-processing 0.00 sec ( 0.01%)
[50]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2,
solver="cp", strategy="cart",
monotonic_trend_x="descending",
monotonic_trend_y="descending",
max_n_prebins_x=20, max_n_prebins_y=20,
min_bin_size=0.05,
special_codes_x=special_codes_x,
special_codes_y=special_codes_y)
optb.fit(x, y, z)
[50]:
OptimalBinning2D(max_n_prebins_x=20, max_n_prebins_y=20, min_bin_size=0.05,
monotonic_trend_x='descending', monotonic_trend_y='descending',
name_x='ExternalRiskEstimate', name_y='AverageMInFile',
special_codes_x=[-9, -8, -7], special_codes_y=[-9, -8, -7],
strategy='cart')
[51]:
optb.binning_table.build()
[51]:
Bin x | Bin y | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 63.50) | (-inf, 54.50) | 749 | 0.071613 | 87 | 662 | 0.883845 | -1.941530 | 0.201662 | 0.021871 |
1 | [63.50, 67.50) | (-inf, 69.50) | 713 | 0.068171 | 182 | 531 | 0.744741 | -0.982928 | 0.059831 | 0.007192 |
2 | [67.50, 73.50) | (-inf, 64.50) | 811 | 0.077541 | 263 | 548 | 0.675709 | -0.646294 | 0.030883 | 0.003795 |
3 | [73.50, 80.50) | (-inf, 64.50) | 620 | 0.059279 | 324 | 296 | 0.477419 | 0.178212 | 0.001885 | 0.000235 |
4 | [80.50, 84.50) | (-inf, 97.50) | 726 | 0.069414 | 533 | 193 | 0.265840 | 1.103659 | 0.078631 | 0.009359 |
5 | [84.50, inf) | (-inf, 97.50) | 615 | 0.058801 | 511 | 104 | 0.169106 | 1.679806 | 0.139674 | 0.015658 |
6 | (-inf, 59.50) | [54.50, inf) | 746 | 0.071326 | 131 | 615 | 0.824397 | -1.458597 | 0.126107 | 0.014500 |
7 | [59.50, 63.50) | [54.50, inf) | 683 | 0.065303 | 176 | 507 | 0.742313 | -0.970199 | 0.055955 | 0.006732 |
8 | [67.50, 70.50) | [64.50, inf) | 599 | 0.057271 | 256 | 343 | 0.572621 | -0.204725 | 0.002381 | 0.000297 |
9 | [70.50, 75.50) | [64.50, inf) | 1049 | 0.100296 | 596 | 453 | 0.431840 | 0.362176 | 0.013117 | 0.001631 |
10 | [75.50, 80.50) | [64.50, inf) | 960 | 0.091787 | 665 | 295 | 0.307292 | 0.900639 | 0.071115 | 0.008601 |
11 | [63.50, 67.50) | [69.50, inf) | 620 | 0.059279 | 203 | 417 | 0.672581 | -0.632053 | 0.022620 | 0.002781 |
12 | [80.50, inf) | [97.50, inf) | 970 | 0.092743 | 806 | 164 | 0.169072 | 1.680045 | 0.220351 | 0.024702 |
13 | Special | Special | 598 | 0.057176 | 267 | 331 | 0.553512 | -0.127042 | 0.000919 | 0.000115 |
14 | Missing | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 10459 | 1.000000 | 5000 | 5459 | 0.521943 | 1.025133 | 0.117469 |
[52]:
optb.binning_table.plot(metric="event_rate")
[53]:
optb.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Name : ExternalRiskEstimate-AverageMInFile
Status : OPTIMAL
Pre-binning statistics
Number of pre-bins 168
Number of refinements 8046
Solver statistics
Type cp
Number of booleans 51
Number of branches 127
Number of conflicts 5
Objective value 1086067
Best objective bound 1086067
Timing
Total time 0.74 sec
Pre-processing 0.00 sec ( 0.53%)
Pre-binning 0.02 sec ( 3.00%)
Solver 0.71 sec ( 96.12%)
model generation 0.69 sec ( 97.08%)
optimizer 0.02 sec ( 2.92%)
Post-processing 0.00 sec ( 0.19%)
We get a 58x speedup at the cost of -1.51% reduction in IV. The following table summarizes performance improvements:
prebins |
CP + grid |
MIP + grid |
CP + cart |
Speedup CP |
Speed MIP |
---|---|---|---|---|---|
10 (100) |
11.40 s |
5.97 s |
0.69 s |
17x |
9x |
20 (400) |
52.76 s |
52.93 s |
0.74 s |
71x |
72x |
Categorical variables¶
The combination of categorical-categorical and numerical-categorical are supported since version 0.15.0.
[54]:
df = pd.read_csv("data/kaggle/HomeCreditDefaultRisk/application_train.csv",
engine='c')
Case categorical-categorical
[55]:
variable1 = "ORGANIZATION_TYPE"
variable2 = "NAME_INCOME_TYPE"
x = df[variable1].values
y = df[variable2].values
z = df["TARGET"].values
[56]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2,
dtype_x="categorical", dtype_y="categorical",
max_n_bins=10)
optb.fit(x, y, z)
[56]:
OptimalBinning2D(dtype_x='categorical', dtype_y='categorical', max_n_bins=10,
name_x='ORGANIZATION_TYPE', name_y='NAME_INCOME_TYPE')
[57]:
optb.binning_table.build(show_bin_xy=True)
[57]:
Bin | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|
0 | ['Trade: type 4' 'Industry: type 12' 'Transpor... | 55390 | 0.180124 | 52408 | 2982 | 0.053836 | 0.433980 | 0.028327 | 0.003513 |
1 | ['Trade: type 4' 'Industry: type 12' 'Transpor... | 11442 | 0.037208 | 10891 | 551 | 0.048156 | 0.551472 | 0.009006 | 0.001112 |
2 | ['Hotel' 'Industry: type 10' 'Medicine' 'Servi... | 23267 | 0.075662 | 21865 | 1402 | 0.060257 | 0.314502 | 0.006564 | 0.000817 |
3 | ['Housing' 'Industry: type 7' 'Business Entity... | 38863 | 0.126379 | 35966 | 2897 | 0.074544 | 0.086413 | 0.000910 | 0.000114 |
4 | ['Security' 'Industry: type 4' 'Self-employed'... | 19748 | 0.064219 | 17989 | 1759 | 0.089072 | -0.107471 | 0.000776 | 0.000097 |
5 | ['Trade: type 4' 'Industry: type 12' 'Transpor... | 10624 | 0.034548 | 9981 | 643 | 0.060523 | 0.309808 | 0.002914 | 0.000363 |
6 | ['Hotel' 'Industry: type 10' 'Medicine' 'Servi... | 35568 | 0.115664 | 32792 | 2776 | 0.078048 | 0.036688 | 0.000153 | 0.000019 |
7 | ['Housing' 'Industry: type 7' 'Business Entity... | 69102 | 0.224714 | 62197 | 6905 | 0.099925 | -0.234425 | 0.013626 | 0.001699 |
8 | ['Security' 'Industry: type 4' 'Self-employed'... | 31207 | 0.101483 | 27795 | 3412 | 0.109334 | -0.334928 | 0.013102 | 0.001630 |
9 | ['Agriculture' 'Realtor' 'Industry: type 3' 'I... | 12300 | 0.039999 | 10802 | 1498 | 0.121789 | -0.456885 | 0.010111 | 0.001253 |
10 | Special | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
11 | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 307511 | 1.000000 | 282686 | 24825 | 0.080729 | 0.085490 | 0.010617 |
[58]:
optb.splits
[58]:
([array(['Trade: type 4', 'Industry: type 12', 'Transport: type 1',
'Trade: type 6', 'Security Ministries', 'University', 'Police',
'Military', 'Bank', 'XNA', 'Culture', 'Insurance', 'Religion',
'School', 'Trade: type 5', 'Hotel', 'Industry: type 10',
'Medicine', 'Services', 'Electricity', 'Industry: type 9',
'Industry: type 5', 'Government', 'Trade: type 2', 'Kindergarten',
'Emergency', 'Industry: type 6', 'Industry: type 2', 'Telecom',
'Other', 'Transport: type 2', 'Legal Services', 'Housing',
'Industry: type 7', 'Business Entity Type 1', 'Advertising',
'Postal', 'Business Entity Type 2', 'Industry: type 11',
'Trade: type 1', 'Mobile', 'Transport: type 4',
'Business Entity Type 3', 'Trade: type 7', 'Security',
'Industry: type 4', 'Self-employed', 'Trade: type 3',
'Agriculture', 'Realtor', 'Industry: type 3', 'Industry: type 1',
'Cleaning', 'Construction', 'Restaurant', 'Industry: type 8',
'Industry: type 13', 'Transport: type 3'], dtype=object),
array(['Trade: type 4', 'Industry: type 12', 'Transport: type 1',
'Trade: type 6', 'Security Ministries', 'University', 'Police',
'Military', 'Bank', 'XNA', 'Culture', 'Insurance', 'Religion',
'School', 'Trade: type 5'], dtype=object),
array(['Hotel', 'Industry: type 10', 'Medicine', 'Services',
'Electricity', 'Industry: type 9', 'Industry: type 5',
'Government', 'Trade: type 2', 'Kindergarten', 'Emergency',
'Industry: type 6', 'Industry: type 2', 'Telecom', 'Other',
'Transport: type 2', 'Legal Services'], dtype=object),
array(['Housing', 'Industry: type 7', 'Business Entity Type 1',
'Advertising', 'Postal', 'Business Entity Type 2',
'Industry: type 11', 'Trade: type 1', 'Mobile',
'Transport: type 4', 'Business Entity Type 3', 'Trade: type 7'],
dtype=object),
array(['Security', 'Industry: type 4', 'Self-employed', 'Trade: type 3',
'Agriculture', 'Realtor', 'Industry: type 3', 'Industry: type 1',
'Cleaning', 'Construction', 'Restaurant', 'Industry: type 8',
'Industry: type 13', 'Transport: type 3'], dtype=object),
array(['Trade: type 4', 'Industry: type 12', 'Transport: type 1',
'Trade: type 6', 'Security Ministries', 'University', 'Police',
'Military', 'Bank', 'XNA', 'Culture', 'Insurance', 'Religion',
'School', 'Trade: type 5'], dtype=object),
array(['Hotel', 'Industry: type 10', 'Medicine', 'Services',
'Electricity', 'Industry: type 9', 'Industry: type 5',
'Government', 'Trade: type 2', 'Kindergarten', 'Emergency',
'Industry: type 6', 'Industry: type 2', 'Telecom', 'Other',
'Transport: type 2', 'Legal Services'], dtype=object),
array(['Housing', 'Industry: type 7', 'Business Entity Type 1',
'Advertising', 'Postal', 'Business Entity Type 2',
'Industry: type 11', 'Trade: type 1', 'Mobile',
'Transport: type 4', 'Business Entity Type 3', 'Trade: type 7'],
dtype=object),
array(['Security', 'Industry: type 4', 'Self-employed', 'Trade: type 3'],
dtype=object),
array(['Agriculture', 'Realtor', 'Industry: type 3', 'Industry: type 1',
'Cleaning', 'Construction', 'Restaurant', 'Industry: type 8',
'Industry: type 13', 'Transport: type 3'], dtype=object)],
[array(['Businessman', 'Student', 'Pensioner'], dtype=object),
array(['State servant', 'Commercial associate'], dtype=object),
array(['State servant', 'Commercial associate'], dtype=object),
array(['State servant', 'Commercial associate'], dtype=object),
array(['State servant', 'Commercial associate'], dtype=object),
array(['Working', 'Unemployed', 'Maternity leave'], dtype=object),
array(['Working', 'Unemployed', 'Maternity leave'], dtype=object),
array(['Working', 'Unemployed', 'Maternity leave'], dtype=object),
array(['Working', 'Unemployed', 'Maternity leave'], dtype=object),
array(['Working', 'Unemployed', 'Maternity leave'], dtype=object)])
[59]:
optb.binning_table.plot(metric="event_rate")
[60]:
z_transform_woe = optb.transform(x, y, metric="woe")
pd.Series(z_transform_woe).value_counts()
[60]:
-0.234425 69102
0.433980 55390
0.086413 38863
0.036688 35568
-0.334928 31207
0.314502 23267
-0.107471 19748
-0.456885 12300
0.551472 11442
0.309808 10624
dtype: int64
[61]:
z_transform_indices = optb.transform(x, y, metric="indices")
pd.Series(z_transform_indices).value_counts()
[61]:
7 69102
0 55390
3 38863
6 35568
8 31207
2 23267
4 19748
9 12300
1 11442
5 10624
dtype: int64
Case numerical-categorical
[62]:
variable1 = "AMT_INCOME_TOTAL"
variable2 = "NAME_INCOME_TYPE"
x = df[variable1].values
y = df[variable2].values
[63]:
optb = OptimalBinning2D(name_x=variable1, name_y=variable2,
dtype_x="numerical", dtype_y="categorical",
monotonic_trend_x="descending",
monotonic_trend_y="ascending")
optb.fit(x, y, z)
[63]:
OptimalBinning2D(dtype_y='categorical', monotonic_trend_x='descending',
monotonic_trend_y='ascending', name_x='AMT_INCOME_TOTAL',
name_y='NAME_INCOME_TYPE')
[64]:
optb.binning_table.build(show_bin_xy=True)
[64]:
Bin | Count | Count (%) | Non-event | Event | Event rate | WoE | IV | JS | |
---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 184511.25) $\cup$ ['Businessman' 'Stude... | 45526 | 0.148047 | 43045 | 2481 | 0.054496 | 0.421099 | 0.022037 | 0.002734 |
1 | [184511.25, 232717.50) $\cup$ ['Businessman' '... | 5400 | 0.017560 | 5121 | 279 | 0.051667 | 0.477408 | 0.003283 | 0.000407 |
2 | [232717.50, 310950.00) $\cup$ ['Businessman' '... | 4955 | 0.016113 | 4723 | 232 | 0.046821 | 0.580977 | 0.004277 | 0.000527 |
3 | [310950.00, inf) $\cup$ ['Businessman' 'Studen... | 3723 | 0.012107 | 3570 | 153 | 0.041096 | 0.717397 | 0.004638 | 0.000568 |
4 | (-inf, 76477.50) $\cup$ ['State servant'] | 1299 | 0.004224 | 1206 | 93 | 0.071594 | 0.129979 | 0.000068 | 0.000008 |
5 | [76477.50, 184511.25) $\cup$ ['State servant'] | 12623 | 0.041049 | 11820 | 803 | 0.063614 | 0.256708 | 0.002430 | 0.000303 |
6 | [184511.25, 232717.50) $\cup$ ['State servant'] | 3567 | 0.011600 | 3377 | 190 | 0.053266 | 0.445233 | 0.001911 | 0.000237 |
7 | (-inf, 76477.50) $\cup$ ['Commercial associate'] | 1917 | 0.006234 | 1734 | 183 | 0.095462 | -0.183786 | 0.000227 | 0.000028 |
8 | [76477.50, 184511.25) $\cup$ ['Commercial asso... | 39005 | 0.126841 | 35809 | 3196 | 0.081938 | -0.016186 | 0.000033 | 0.000004 |
9 | [184511.25, 232717.50) $\cup$ ['Commercial ass... | 12996 | 0.042262 | 12079 | 917 | 0.070560 | 0.145631 | 0.000843 | 0.000105 |
10 | [232717.50, 310950.00) $\cup$ ['Commercial ass... | 8090 | 0.026308 | 7558 | 532 | 0.065760 | 0.221233 | 0.001174 | 0.000146 |
11 | [310950.00, inf) $\cup$ ['Commercial associate'] | 9609 | 0.031248 | 9077 | 532 | 0.055365 | 0.404370 | 0.004319 | 0.000536 |
12 | (-inf, 76477.50) $\cup$ ['Working' 'Unemployed... | 10879 | 0.035378 | 9786 | 1093 | 0.100469 | -0.240459 | 0.002263 | 0.000282 |
13 | [76477.50, 184511.25) $\cup$ ['Working' 'Unemp... | 104920 | 0.341191 | 94465 | 10455 | 0.099647 | -0.231336 | 0.020121 | 0.002510 |
14 | [184511.25, 232717.50) $\cup$ ['Working' 'Unem... | 22580 | 0.073428 | 20492 | 2088 | 0.092471 | -0.148658 | 0.001727 | 0.000216 |
15 | [232717.50, 310950.00) $\cup$ ['Working' 'Unem... | 11666 | 0.037937 | 10688 | 978 | 0.083833 | -0.041118 | 0.000065 | 0.000008 |
16 | [310950.00, inf) $\cup$ ['Working' 'Unemployed... | 8756 | 0.028474 | 8136 | 620 | 0.070809 | 0.141849 | 0.000540 | 0.000067 |
17 | Special | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
18 | Missing | 0 | 0.000000 | 0 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
Totals | 307511 | 1.000000 | 282686 | 24825 | 0.080729 | 0.069958 | 0.008688 |
[65]:
optb.binning_table.plot(metric="event_rate")
[66]:
z_transform_woe = optb.transform(x, y, metric="woe")
pd.Series(z_transform_woe).value_counts()
[66]:
-0.231336 104920
0.421099 45526
-0.016186 39005
-0.148658 22580
0.145631 12996
0.256708 12623
-0.041118 11666
-0.240459 10879
0.404370 9609
0.141849 8756
0.221233 8090
0.477408 5400
0.580977 4955
0.717397 3723
0.445233 3567
-0.183786 1917
0.129979 1299
dtype: int64
[67]:
z_transform_indices = optb.transform(x, y, metric="indices")
pd.Series(z_transform_indices).value_counts()
[67]:
13 104920
0 45526
8 39005
14 22580
9 12996
5 12623
15 11666
12 10879
11 9609
16 8756
10 8090
1 5400
2 4955
3 3723
6 3567
7 1917
4 1299
dtype: int64