Tutorial: optimal piecewise binning with continuous target

Basic

To get us started, let’s load a well-known dataset from the UCI repository and transform the data into a pandas.DataFrame.

[1]:
import pandas as pd
from tests.datasets import load_boston
[2]:
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)

We choose a variable to discretize and the continuous target.

[3]:
variable = "LSTAT"
x = df[variable].values
y = data.target

Import and instantiate an ContinuousOptimalPWBinning object class and we pass the variable name. The ContinuousOptimalPWBinning can ONLY handle numerical variables. This differs from the ContinuousOptimalBinning object class.

[4]:
from optbinning import ContinuousOptimalPWBinning
[5]:
optb = ContinuousOptimalPWBinning(name=variable)

We fit the optimal binning object with arrays x and y.

[6]:
optb.fit(x, y)
[6]:
ContinuousOptimalPWBinning(name='LSTAT')

You can check if an optimal solution has been found via the status attribute:

[7]:
optb.status
[7]:
'OPTIMAL'

You can also retrieve the optimal split points via the splits attribute:

[8]:
optb.splits
[8]:
array([ 4.6500001 ,  5.49499989,  6.86500001,  9.7249999 , 13.0999999 ,
       14.4000001 , 17.23999977, 19.89999962, 23.31500053])

The binning table

The optimal binning algorithms return a binning table; a binning table displays the binned data and several metrics for each bin. Class ContinuousOptimalPWBinning returns an object PWContinuousBinningTable via the binning_table attribute.

[9]:
binning_table = optb.binning_table
[10]:
type(binning_table)
[10]:
optbinning.binning.piecewise.binning_statistics.PWContinuousBinningTable

The binning_table is instantiated, but not built. Therefore, the first step is to call the method build, which returns a pandas.DataFrame.

[11]:
binning_table.build()
[11]:
Bin Count Count (%) Sum Std Min Max Zeros count c0 c1
0 (-inf, 4.65) 50.0 0.098814 1985.9 8.198651 22.8 50.0 0 56.744157 -4.821782
1 [4.65, 5.49) 28.0 0.055336 853.2 6.123541 21.9 50.0 0 74.661294 -8.674929
2 [5.49, 6.87) 45.0 0.088933 1188.6 5.136259 20.6 48.8 0 26.992559 -0.000000
3 [6.87, 9.72) 89.0 0.175889 2274.9 6.845250 11.9 50.0 0 36.396429 -1.369828
4 [9.72, 13.10) 84.0 0.166008 1755.4 2.949979 14.5 31.0 0 31.699081 -0.886810
5 [13.10, 14.40) 32.0 0.063241 667.4 2.632482 15.0 29.6 0 27.030856 -0.530457
6 [14.40, 17.24) 60.0 0.118577 1037.5 3.588003 10.2 30.7 0 35.230331 -1.099865
7 [17.24, 19.90) 43.0 0.084980 714.3 4.032554 8.3 27.5 0 23.261993 -0.405646
8 [19.90, 23.32) 28.0 0.055336 368.4 3.912839 5.0 21.7 0 33.251552 -0.907634
9 [23.32, inf) 47.0 0.092885 556.0 4.006586 5.0 23.7 0 13.222398 -0.048567
10 Special 0.0 0.000000 0.0 None None None 0 0.000000 0.000000
11 Missing 0.0 0.000000 0.0 None None None 0 0.000000 0.000000
Totals 506.0 1.000000 11401.6 5.0 50.0 0 - -

Let’s describe the columns of this binning table:

  • Bin: the intervals delimited by the optimal split points.

  • Count: the number of records for each bin.

  • Count (%): the percentage of records for each bin.

  • Sum: the target sum for each bin.

  • Std: the target std for each bin.

  • Min: the target min value for each bin.

  • Max: the target max value for each bin.

  • Zeros count: the number of zeros for each bin.

  • \(c_0\): the first coefficient of the prediction polynomial.

  • \(c_1\): the second coefficient of the prediction polynomial.

The for bin \(i\) is defined as \(P_i = c_0 + c_1 x_i\), where \(x_i \in \text{Bin}_{i}\). In general, \begin{equation} P_i = \sum_{j=0}^d c_j x_i^j, \end{equation} where \(d\) is the degree of the event rate polynomial.

The last row shows the total number of records, sum and mean.

You can use the method plot to visualize the histogram and mean curve.

[12]:
binning_table.plot()
../_images/tutorials_tutorial_piecewise_continuous_24_0.png

Mean transformation

Now that we have checked the binned data, we can transform our original data into mean values.

[13]:
x_transform_mean = optb.transform(x)

Advanced

Many of the advanced options have been covered in the previous tutorials with a binary target. Check it out! In this section, we focus on the mean monotonicity trend and the mean difference between bins.

Binning table statistical analysis

The analysis method performs a statistical analysis of the binning table, computing the Information Value (IV) and Herfindahl-Hirschman Index (HHI). Additionally, several statistical significance tests between consecutive bins of the contingency table are performed using the Student’s t-test. The piecewise binning also includes several performance metrics for regression problems.

[14]:
binning_table.analysis()
-------------------------------------------------
OptimalBinning: Continuous Binning Table Analysis
-------------------------------------------------

  General metrics

    Mean absolute error              3.67892091
    Mean squared error              26.06425104
    Median absolute error            2.71794576
    Explained variance               0.69125340
    R^2                              0.69125340
    MPE                             -0.05221127
    MAPE                             0.17785599
    SMAPE                            0.08395478
    MdAPE                            0.13052555
    SMdAPE                           0.06542912
    HHI                              0.11620241
    HHI (normalized)                 0.03585717
    Quality score                    0.01671264

  Significance tests

    Bin A  Bin B  t-statistic      p-value
        0      1     5.644492 3.313748e-07
        1      2     2.924528 5.175586e-03
        2      3     0.808313 4.206096e-01
        3      4     5.874488 3.816654e-08
        4      5     0.073112 9.419504e-01
        5      6     5.428848 5.770714e-07
        6      7     0.883289 3.796030e-01
        7      8     3.591859 6.692488e-04
        8      9     1.408305 1.643801e-01

[15]:
binning_table.analysis()
-------------------------------------------------
OptimalBinning: Continuous Binning Table Analysis
-------------------------------------------------

  General metrics

    Mean absolute error              3.67892091
    Mean squared error              26.06425104
    Median absolute error            2.71794576
    Explained variance               0.69125340
    R^2                              0.69125340
    MPE                             -0.05221127
    MAPE                             0.17785599
    SMAPE                            0.08395478
    MdAPE                            0.13052555
    SMdAPE                           0.06542912
    HHI                              0.11620241
    HHI (normalized)                 0.03585717
    Quality score                    0.01671264

  Significance tests

    Bin A  Bin B  t-statistic      p-value
        0      1     5.644492 3.313748e-07
        1      2     2.924528 5.175586e-03
        2      3     0.808313 4.206096e-01
        3      4     5.874488 3.816654e-08
        4      5     0.073112 9.419504e-01
        5      6     5.428848 5.770714e-07
        6      7     0.883289 3.796030e-01
        7      8     3.591859 6.692488e-04
        8      9     1.408305 1.643801e-01

Mean monotonicity

The monotonic_trend option permits forcing a monotonic trend to the mean curve. The default setting “auto” should be the preferred option, however, some business constraints might require to impose different trends. The default setting “auto” chooses the monotonic trend most likely to minimize the L1-norm from the options “ascending”, “descending”, “peak” and “valley” using a machine-learning-based classifier.

[16]:
variable = "INDUS"
x = df[variable].values
y = data.target
[17]:
optb = ContinuousOptimalPWBinning(name=variable, monotonic_trend="auto")
optb.fit(x, y)
[17]:
ContinuousOptimalPWBinning(name='INDUS')
[18]:
binning_table = optb.binning_table
binning_table.build()
[18]:
Bin Count Count (%) Sum Std Min Max Zeros count c0 c1
0 (-inf, 3.35) 63.0 0.124506 1994.0 8.569841 16.5 50.0 0 31.494243 0.000000
1 [3.35, 5.04) 57.0 0.112648 1615.2 8.072710 17.2 50.0 0 44.441955 -3.864989
2 [5.04, 6.66) 66.0 0.130435 1723.7 7.879078 16.0 50.0 0 24.962412 0.000000
3 [6.66, 9.12) 64.0 0.126482 1292.0 4.614126 12.7 35.2 0 39.414039 -2.169914
4 [9.12, 10.30) 29.0 0.057312 584.1 2.252281 16.1 24.5 0 19.613573 0.000000
5 [10.30, 20.73) 200.0 0.395257 3736.2 8.959305 5.0 50.0 0 21.429532 -0.176307
6 [20.73, inf) 27.0 0.053360 456.4 3.690878 7.0 23.0 0 23.933551 -0.297070
7 Special 0.0 0.000000 0.0 None None None 0 0.000000 0.000000
8 Missing 0.0 0.000000 0.0 None None None 0 0.000000 0.000000
Totals 506.0 1.000000 11401.6 5.0 50.0 0 - -
[19]:
binning_table.plot()
../_images/tutorials_tutorial_piecewise_continuous_39_0.png
[20]:
binning_table.analysis()
-------------------------------------------------
OptimalBinning: Continuous Binning Table Analysis
-------------------------------------------------

  General metrics

    Mean absolute error              5.44142744
    Mean squared error              59.78870565
    Median absolute error            4.13838129
    Explained variance               0.29176712
    R^2                              0.29176712
    MPE                             -0.12274348
    MAPE                             0.28239036
    SMAPE                            0.12302499
    MdAPE                            0.19474014
    SMdAPE                           0.09639011
    HHI                              0.22356231
    HHI (normalized)                 0.12650760
    Quality score                    0.03355055

  Significance tests

    Bin A  Bin B  t-statistic      p-value
        0      1     2.180865 3.118080e-02
        1      2     1.537968 1.267445e-01
        2      3     5.254539 7.781110e-07
        3      4     0.064736 9.485275e-01
        4      5     1.923770 5.601023e-02
        5      6     1.867339 6.563949e-02

A smoother curve, keeping the valley monotonicity, can be achieved by using monotonic_trend="convex".

[21]:
optb = ContinuousOptimalPWBinning(name=variable, monotonic_trend="convex")
optb.fit(x, y)
[21]:
ContinuousOptimalPWBinning(monotonic_trend='convex', name='INDUS')
[22]:
binning_table = optb.binning_table
binning_table.build()
[22]:
Bin Count Count (%) Sum Std Min Max Zeros count c0 c1
0 (-inf, 3.35) 63.0 0.124506 1994.0 8.569841 16.5 50.0 0 35.353511 -1.753581
1 [3.35, 5.04) 57.0 0.112648 1615.2 8.072710 17.2 50.0 0 35.353511 -1.753581
2 [5.04, 6.66) 66.0 0.130435 1723.7 7.879078 16.0 50.0 0 34.324955 -1.549503
3 [6.66, 9.12) 64.0 0.126482 1292.0 4.614126 12.7 35.2 0 34.324955 -1.549503
4 [9.12, 10.30) 29.0 0.057312 584.1 2.252281 16.1 24.5 0 22.168554 -0.217294
5 [10.30, 20.73) 200.0 0.395257 3736.2 8.959305 5.0 50.0 0 22.168554 -0.217294
6 [20.73, inf) 27.0 0.053360 456.4 3.690878 7.0 23.0 0 22.168554 -0.217294
7 Special 0.0 0.000000 0.0 None None None 0 0.000000 0.000000
8 Missing 0.0 0.000000 0.0 None None None 0 0.000000 0.000000
Totals 506.0 1.000000 11401.6 5.0 50.0 0 - -
[23]:
binning_table.plot()
../_images/tutorials_tutorial_piecewise_continuous_44_0.png
[24]:
binning_table.analysis()
-------------------------------------------------
OptimalBinning: Continuous Binning Table Analysis
-------------------------------------------------

  General metrics

    Mean absolute error              5.51198416
    Mean squared error              60.76046517
    Median absolute error            4.19400512
    Explained variance               0.28025605
    R^2                              0.28025605
    MPE                             -0.12411637
    MAPE                             0.28485622
    SMAPE                            0.12424799
    MdAPE                            0.19529048
    SMdAPE                           0.09815566
    HHI                              0.22356231
    HHI (normalized)                 0.12650760
    Quality score                    0.03355055

  Significance tests

    Bin A  Bin B  t-statistic      p-value
        0      1     2.180865 3.118080e-02
        1      2     1.537968 1.267445e-01
        2      3     5.254539 7.781110e-07
        3      4     0.064736 9.485275e-01
        4      5     1.923770 5.601023e-02
        5      6     1.867339 6.563949e-02