Tutorial: Binning process with sklearn Pipeline

This example shows how to use a binning process as a transformation within a Scikit-learn Pipeline. A pipeline generally comprises the application of one or more transforms and a final estimator.

[1]:
from optbinning import BinningProcess
from tests.datasets import load_boston

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

To get us started, let’s load a well-known dataset from the UCI repository

[2]:
data = load_boston()

variable_names = data.feature_names
X = data.data
y = data.target
[3]:
categorical_variables = ['CHAS']

Instantiate a BinningProcess object class with variable names and the list of numerical variables to be considered categorical. Create pipeline object by providing two steps: a binning process transformer and a linear regression estimator.

[4]:
binning_process = BinningProcess(variable_names,
                                 categorical_variables=categorical_variables)
[5]:
lr = Pipeline(steps=[('binning_process', binning_process),
                     ('regressor', LinearRegression())])

Split dataset into train and test Fit pipeline with training data.

[6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
[7]:
lr.fit(X_train, y_train)
[7]:
Pipeline(steps=[('binning_process',
                 BinningProcess(categorical_variables=['CHAS'],
                                variable_names=['CRIM', 'ZN', 'INDUS', 'CHAS',
                                                'NOX', 'RM', 'AGE', 'DIS',
                                                'RAD', 'TAX', 'PTRATIO', 'B',
                                                'LSTAT'])),
                ('regressor', LinearRegression())])
[8]:
y_test_predict = lr.predict(X_test)

print("MSE:      {:.3f}".format(mean_squared_error(y_test, y_test_predict)))
print("MAE:      {:.3f}".format(mean_absolute_error(y_test, y_test_predict)))
print("R2 score: {:.3f}".format(r2_score(y_test, y_test_predict)))
MSE:      13.602
MAE:      2.490
R2 score: 0.815

In this case, the performance metrics show that the binning process transformation is effective in improving predictions.

[9]:
lr2 = LinearRegression()
lr2.fit(X_train, y_train)

y_test_predict = lr2.predict(X_test)

print("MSE:      {:.3f}".format(mean_squared_error(y_test, y_test_predict)))
print("MAE:      {:.3f}".format(mean_absolute_error(y_test, y_test_predict)))
print("R2 score: {:.3f}".format(r2_score(y_test, y_test_predict)))
MSE:      24.291
MAE:      3.189
R2 score: 0.669

Binning process statistics

The binning process of the pipeline can be retrieved to show information about the problem and timing statistics.

[10]:
binning_process.information(print_level=1)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Statistics
    Number of records                    404
    Number of variables                   13
    Target type                   continuous

    Number of numerical                   12
    Number of categorical                  1
    Number of selected                    13

  Time                                2.1547 sec

The summary method returns basic statistics for each binned variable.

[11]:
binning_process.summary()
[11]:
name dtype status selected n_bins woe quality_score
0 CRIM numerical OPTIMAL True 10 93.949585 0.001813
1 ZN numerical OPTIMAL True 3 62.306865 0.317480
2 INDUS numerical OPTIMAL True 7 75.118213 0.047208
3 CHAS categorical OPTIMAL True 2 52.476876 0.099815
4 NOX numerical OPTIMAL True 7 86.777245 0.073814
5 RM numerical OPTIMAL True 9 104.021845 0.029845
6 AGE numerical OPTIMAL True 9 75.403568 0.002654
7 DIS numerical OPTIMAL True 8 73.097554 0.010411
8 RAD numerical OPTIMAL True 4 62.798802 0.373463
9 TAX numerical OPTIMAL True 6 74.957266 0.188937
10 PTRATIO numerical OPTIMAL True 7 73.905115 0.019986
11 B numerical OPTIMAL True 9 83.013298 0.059642
12 LSTAT numerical OPTIMAL True 12 124.279471 0.001798