Tutorial: optimal piecewise binning with continuous target¶

Basic¶

To get us started, let’s load a well-known dataset from the UCI repository and transform the data into a pandas.DataFrame.

[1]:

import pandas as pd
from tests.datasets import load_boston

[2]:

data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)

We choose a variable to discretize and the continuous target.

[3]:

variable = "LSTAT"
x = df[variable].values
y = data.target

Import and instantiate an ContinuousOptimalPWBinning object class and we pass the variable name. The ContinuousOptimalPWBinning can ONLY handle numerical variables. This differs from the ContinuousOptimalBinning object class.

[4]:

from optbinning import ContinuousOptimalPWBinning

[5]:

optb = ContinuousOptimalPWBinning(name=variable)

We fit the optimal binning object with arrays x and y.

[6]:

optb.fit(x, y)

[6]:

ContinuousOptimalPWBinning(name='LSTAT')

You can check if an optimal solution has been found via the status attribute:

[7]:

optb.status

[7]:

'OPTIMAL'

You can also retrieve the optimal split points via the splits attribute:

[8]:

optb.splits

[8]:

array([ 4.6500001 ,  5.49499989,  6.86500001,  9.7249999 , 13.0999999 ,
       14.4000001 , 17.23999977, 19.89999962, 23.31500053])

The binning table¶

The optimal binning algorithms return a binning table; a binning table displays the binned data and several metrics for each bin. Class ContinuousOptimalPWBinning returns an object PWContinuousBinningTable via the binning_table attribute.

[9]:

binning_table = optb.binning_table

[10]:

type(binning_table)

[10]:

optbinning.binning.piecewise.binning_statistics.PWContinuousBinningTable

The binning_table is instantiated, but not built. Therefore, the first step is to call the method build, which returns a pandas.DataFrame.

[11]:

binning_table.build()

[11]:

	Bin	Count	Count (%)	Sum	Std	Min	Max	c0	c1
0	(-inf, 4.65)	50.0	0.098814	1985.9	8.198651	22.8	50.0	56.744157	-4.821782
1	[4.65, 5.49)	28.0	0.055336	853.2	6.123541	21.9	50.0	74.661294	-8.674929
2	[5.49, 6.87)	45.0	0.088933	1188.6	5.136259	20.6	48.8	26.992559	-0.000000
3	[6.87, 9.72)	89.0	0.175889	2274.9	6.845250	11.9	50.0	36.396429	-1.369828
4	[9.72, 13.10)	84.0	0.166008	1755.4	2.949979	14.5	31.0	31.699081	-0.886810
5	[13.10, 14.40)	32.0	0.063241	667.4	2.632482	15.0	29.6	27.030856	-0.530457
6	[14.40, 17.24)	60.0	0.118577	1037.5	3.588003	10.2	30.7	35.230331	-1.099865
7	[17.24, 19.90)	43.0	0.084980	714.3	4.032554	8.3	27.5	23.261993	-0.405646
8	[19.90, 23.32)	28.0	0.055336	368.4	3.912839	5.0	21.7	33.251552	-0.907634
9	[23.32, inf)	47.0	0.092885	556.0	4.006586	5.0	23.7	13.222398	-0.048567
10	Special	0.0	0.000000	0.0	None	None	None	0.000000	0.000000
11	Missing	0.0	0.000000	0.0	None	None	None	0.000000	0.000000
Totals		506.0	1.000000	11401.6		5.0	50.0	-	-

Let’s describe the columns of this binning table:

Bin: the intervals delimited by the optimal split points.
Count: the number of records for each bin.
Count (%): the percentage of records for each bin.
Sum: the target sum for each bin.
Std: the target std for each bin.
Min: the target min value for each bin.
Max: the target max value for each bin.
Zeros count: the number of zeros for each bin.
\(c_0\): the first coefficient of the prediction polynomial.
\(c_1\): the second coefficient of the prediction polynomial.

The for bin \(i\) is defined as \(P_i = c_0 + c_1 x_i\), where \(x_i \in \text{Bin}_{i}\). In general, \begin{equation} P_i = \sum_{j=0}^d c_j x_i^j, \end{equation} where \(d\) is the degree of the event rate polynomial.

The last row shows the total number of records, sum and mean.

You can use the method plot to visualize the histogram and mean curve.

[12]:

binning_table.plot()

../_images/tutorials_tutorial_piecewise_continuous_24_0.png

Mean transformation¶

Now that we have checked the binned data, we can transform our original data into mean values.

[13]:

x_transform_mean = optb.transform(x)

Advanced¶

Many of the advanced options have been covered in the previous tutorials with a binary target. Check it out! In this section, we focus on the mean monotonicity trend and the mean difference between bins.

Binning table statistical analysis¶

The analysis method performs a statistical analysis of the binning table, computing the Information Value (IV) and Herfindahl-Hirschman Index (HHI). Additionally, several statistical significance tests between consecutive bins of the contingency table are performed using the Student’s t-test. The piecewise binning also includes several performance metrics for regression problems.

[14]:

binning_table.analysis()

-------------------------------------------------
OptimalBinning: Continuous Binning Table Analysis
-------------------------------------------------

  General metrics

    Mean absolute error              3.67892091
    Mean squared error              26.06425104
    Median absolute error            2.71794576
    Explained variance               0.69125340
    R^2                              0.69125340
    MPE                             -0.05221127
    MAPE                             0.17785599
    SMAPE                            0.08395478
    MdAPE                            0.13052555
    SMdAPE                           0.06542912
    HHI                              0.11620241
    HHI (normalized)                 0.03585717
    Quality score                    0.01671264

  Significance tests

    Bin A  Bin B  t-statistic      p-value
        0      1     5.644492 3.313748e-07
        1      2     2.924528 5.175586e-03
        2      3     0.808313 4.206096e-01
        3      4     5.874488 3.816654e-08
        4      5     0.073112 9.419504e-01
        5      6     5.428848 5.770714e-07
        6      7     0.883289 3.796030e-01
        7      8     3.591859 6.692488e-04
        8      9     1.408305 1.643801e-01

[15]:

binning_table.analysis()

-------------------------------------------------
OptimalBinning: Continuous Binning Table Analysis
-------------------------------------------------

  General metrics

    Mean absolute error              3.67892091
    Mean squared error              26.06425104
    Median absolute error            2.71794576
    Explained variance               0.69125340
    R^2                              0.69125340
    MPE                             -0.05221127
    MAPE                             0.17785599
    SMAPE                            0.08395478
    MdAPE                            0.13052555
    SMdAPE                           0.06542912
    HHI                              0.11620241
    HHI (normalized)                 0.03585717
    Quality score                    0.01671264

  Significance tests

    Bin A  Bin B  t-statistic      p-value
        0      1     5.644492 3.313748e-07
        1      2     2.924528 5.175586e-03
        2      3     0.808313 4.206096e-01
        3      4     5.874488 3.816654e-08
        4      5     0.073112 9.419504e-01
        5      6     5.428848 5.770714e-07
        6      7     0.883289 3.796030e-01
        7      8     3.591859 6.692488e-04
        8      9     1.408305 1.643801e-01

Mean monotonicity¶

The monotonic_trend option permits forcing a monotonic trend to the mean curve. The default setting “auto” should be the preferred option, however, some business constraints might require to impose different trends. The default setting “auto” chooses the monotonic trend most likely to minimize the L1-norm from the options “ascending”, “descending”, “peak” and “valley” using a machine-learning-based classifier.

[16]:

variable = "INDUS"
x = df[variable].values
y = data.target

[17]:

optb = ContinuousOptimalPWBinning(name=variable, monotonic_trend="auto")
optb.fit(x, y)

[17]:

ContinuousOptimalPWBinning(name='INDUS')

[18]:

binning_table = optb.binning_table
binning_table.build()

[18]:

	Bin	Count	Count (%)	Sum	Std	Min	Max	c0	c1
0	(-inf, 3.35)	63.0	0.124506	1994.0	8.569841	16.5	50.0	31.494243	0.000000
1	[3.35, 5.04)	57.0	0.112648	1615.2	8.072710	17.2	50.0	44.441955	-3.864989
2	[5.04, 6.66)	66.0	0.130435	1723.7	7.879078	16.0	50.0	24.962412	0.000000
3	[6.66, 9.12)	64.0	0.126482	1292.0	4.614126	12.7	35.2	39.414039	-2.169914
4	[9.12, 10.30)	29.0	0.057312	584.1	2.252281	16.1	24.5	19.613573	0.000000
5	[10.30, 20.73)	200.0	0.395257	3736.2	8.959305	5.0	50.0	21.429532	-0.176307
6	[20.73, inf)	27.0	0.053360	456.4	3.690878	7.0	23.0	23.933551	-0.297070
7	Special	0.0	0.000000	0.0	None	None	None	0.000000	0.000000
8	Missing	0.0	0.000000	0.0	None	None	None	0.000000	0.000000
Totals		506.0	1.000000	11401.6		5.0	50.0	-	-

[19]:

binning_table.plot()

../_images/tutorials_tutorial_piecewise_continuous_39_0.png

[20]:

binning_table.analysis()

-------------------------------------------------
OptimalBinning: Continuous Binning Table Analysis
-------------------------------------------------

  General metrics

    Mean absolute error              5.44142744
    Mean squared error              59.78870565
    Median absolute error            4.13838129
    Explained variance               0.29176712
    R^2                              0.29176712
    MPE                             -0.12274348
    MAPE                             0.28239036
    SMAPE                            0.12302499
    MdAPE                            0.19474014
    SMdAPE                           0.09639011
    HHI                              0.22356231
    HHI (normalized)                 0.12650760
    Quality score                    0.03355055

  Significance tests

    Bin A  Bin B  t-statistic      p-value
        0      1     2.180865 3.118080e-02
        1      2     1.537968 1.267445e-01
        2      3     5.254539 7.781110e-07
        3      4     0.064736 9.485275e-01
        4      5     1.923770 5.601023e-02
        5      6     1.867339 6.563949e-02

A smoother curve, keeping the valley monotonicity, can be achieved by using monotonic_trend="convex".

[21]:

optb = ContinuousOptimalPWBinning(name=variable, monotonic_trend="convex")
optb.fit(x, y)

[21]:

ContinuousOptimalPWBinning(monotonic_trend='convex', name='INDUS')

[22]:

binning_table = optb.binning_table
binning_table.build()

[22]:

	Bin	Count	Count (%)	Sum	Std	Min	Max	c0	c1
0	(-inf, 3.35)	63.0	0.124506	1994.0	8.569841	16.5	50.0	35.353511	-1.753581
1	[3.35, 5.04)	57.0	0.112648	1615.2	8.072710	17.2	50.0	35.353511	-1.753581
2	[5.04, 6.66)	66.0	0.130435	1723.7	7.879078	16.0	50.0	34.324955	-1.549503
3	[6.66, 9.12)	64.0	0.126482	1292.0	4.614126	12.7	35.2	34.324955	-1.549503
4	[9.12, 10.30)	29.0	0.057312	584.1	2.252281	16.1	24.5	22.168554	-0.217294
5	[10.30, 20.73)	200.0	0.395257	3736.2	8.959305	5.0	50.0	22.168554	-0.217294
6	[20.73, inf)	27.0	0.053360	456.4	3.690878	7.0	23.0	22.168554	-0.217294
7	Special	0.0	0.000000	0.0	None	None	None	0.000000	0.000000
8	Missing	0.0	0.000000	0.0	None	None	None	0.000000	0.000000
Totals		506.0	1.000000	11401.6		5.0	50.0	-	-

[23]:

binning_table.plot()

../_images/tutorials_tutorial_piecewise_continuous_44_0.png

[24]:

binning_table.analysis()

-------------------------------------------------
OptimalBinning: Continuous Binning Table Analysis
-------------------------------------------------

  General metrics

    Mean absolute error              5.51198416
    Mean squared error              60.76046517
    Median absolute error            4.19400512
    Explained variance               0.28025605
    R^2                              0.28025605
    MPE                             -0.12411637
    MAPE                             0.28485622
    SMAPE                            0.12424799
    MdAPE                            0.19529048
    SMdAPE                           0.09815566
    HHI                              0.22356231
    HHI (normalized)                 0.12650760
    Quality score                    0.03355055

  Significance tests

    Bin A  Bin B  t-statistic      p-value
        0      1     2.180865 3.118080e-02
        1      2     1.537968 1.267445e-01
        2      3     5.254539 7.781110e-07
        3      4     0.064736 9.485275e-01
        4      5     1.923770 5.601023e-02
        5      6     1.867339 6.563949e-02