Tutorial: Binning process with sklearn Pipeline¶

This example shows how to use a binning process as a transformation within a Scikit-learn Pipeline. A pipeline generally comprises the application of one or more transforms and a final estimator.

[1]:

from optbinning import BinningProcess
from tests.datasets import load_boston

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

To get us started, let’s load a well-known dataset from the UCI repository

[2]:

data = load_boston()

variable_names = data.feature_names
X = data.data
y = data.target

[3]:

categorical_variables = ['CHAS']

Instantiate a BinningProcess object class with variable names and the list of numerical variables to be considered categorical. Create pipeline object by providing two steps: a binning process transformer and a linear regression estimator.

[4]:

binning_process = BinningProcess(variable_names,
                                 categorical_variables=categorical_variables)

[5]:

lr = Pipeline(steps=[('binning_process', binning_process),
                     ('regressor', LinearRegression())])

Split dataset into train and test Fit pipeline with training data.

[6]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

[7]:

lr.fit(X_train, y_train)

[7]:

Pipeline(steps=[('binning_process',
                 BinningProcess(categorical_variables=['CHAS'],
                                variable_names=['CRIM', 'ZN', 'INDUS', 'CHAS',
                                                'NOX', 'RM', 'AGE', 'DIS',
                                                'RAD', 'TAX', 'PTRATIO', 'B',
                                                'LSTAT'])),
                ('regressor', LinearRegression())])

[8]:

y_test_predict = lr.predict(X_test)

print("MSE:      {:.3f}".format(mean_squared_error(y_test, y_test_predict)))
print("MAE:      {:.3f}".format(mean_absolute_error(y_test, y_test_predict)))
print("R2 score: {:.3f}".format(r2_score(y_test, y_test_predict)))

MSE:      13.602
MAE:      2.490
R2 score: 0.815

In this case, the performance metrics show that the binning process transformation is effective in improving predictions.

[9]:

lr2 = LinearRegression()
lr2.fit(X_train, y_train)

y_test_predict = lr2.predict(X_test)

print("MSE:      {:.3f}".format(mean_squared_error(y_test, y_test_predict)))
print("MAE:      {:.3f}".format(mean_absolute_error(y_test, y_test_predict)))
print("R2 score: {:.3f}".format(r2_score(y_test, y_test_predict)))

MSE:      24.291
MAE:      3.189
R2 score: 0.669

Binning process statistics¶

The binning process of the pipeline can be retrieved to show information about the problem and timing statistics.

[10]:

binning_process.information(print_level=1)

optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0

  Statistics
    Number of records                    404
    Number of variables                   13
    Target type                   continuous

    Number of numerical                   12
    Number of categorical                  1
    Number of selected                    13

  Time                                2.1547 sec

The summary method returns basic statistics for each binned variable.

[11]:

binning_process.summary()

[11]:

	name	dtype	status	selected	n_bins	woe	quality_score
0	CRIM	numerical	OPTIMAL	True	10	93.949585	0.001813
1	ZN	numerical	OPTIMAL	True	3	62.306865	0.317480
2	INDUS	numerical	OPTIMAL	True	7	75.118213	0.047208
3	CHAS	categorical	OPTIMAL	True	2	52.476876	0.099815
4	NOX	numerical	OPTIMAL	True	7	86.777245	0.073814
5	RM	numerical	OPTIMAL	True	9	104.021845	0.029845
6	AGE	numerical	OPTIMAL	True	9	75.403568	0.002654
7	DIS	numerical	OPTIMAL	True	8	73.097554	0.010411
8	RAD	numerical	OPTIMAL	True	4	62.798802	0.373463
9	TAX	numerical	OPTIMAL	True	6	74.957266	0.188937
10	PTRATIO	numerical	OPTIMAL	True	7	73.905115	0.019986
11	B	numerical	OPTIMAL	True	9	83.013298	0.059642
12	LSTAT	numerical	OPTIMAL	True	12	124.279471	0.001798