Tutorial: Binning process with sklearn Pipeline¶
This example shows how to use a binning process as a transformation within a Scikit-learn Pipeline. A pipeline generally comprises the application of one or more transforms and a final estimator.
[1]:
from optbinning import BinningProcess
from tests.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
To get us started, let’s load a well-known dataset from the UCI repository
[2]:
data = load_boston()
variable_names = data.feature_names
X = data.data
y = data.target
[3]:
categorical_variables = ['CHAS']
Instantiate a BinningProcess
object class with variable names and the list of numerical variables to be considered categorical. Create pipeline object by providing two steps: a binning process transformer and a linear regression estimator.
[4]:
binning_process = BinningProcess(variable_names,
categorical_variables=categorical_variables)
[5]:
lr = Pipeline(steps=[('binning_process', binning_process),
('regressor', LinearRegression())])
Split dataset into train and test Fit pipeline with training data.
[6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
[7]:
lr.fit(X_train, y_train)
[7]:
Pipeline(steps=[('binning_process',
BinningProcess(categorical_variables=['CHAS'],
variable_names=['CRIM', 'ZN', 'INDUS', 'CHAS',
'NOX', 'RM', 'AGE', 'DIS',
'RAD', 'TAX', 'PTRATIO', 'B',
'LSTAT'])),
('regressor', LinearRegression())])
[8]:
y_test_predict = lr.predict(X_test)
print("MSE: {:.3f}".format(mean_squared_error(y_test, y_test_predict)))
print("MAE: {:.3f}".format(mean_absolute_error(y_test, y_test_predict)))
print("R2 score: {:.3f}".format(r2_score(y_test, y_test_predict)))
MSE: 13.602
MAE: 2.490
R2 score: 0.815
In this case, the performance metrics show that the binning process transformation is effective in improving predictions.
[9]:
lr2 = LinearRegression()
lr2.fit(X_train, y_train)
y_test_predict = lr2.predict(X_test)
print("MSE: {:.3f}".format(mean_squared_error(y_test, y_test_predict)))
print("MAE: {:.3f}".format(mean_absolute_error(y_test, y_test_predict)))
print("R2 score: {:.3f}".format(r2_score(y_test, y_test_predict)))
MSE: 24.291
MAE: 3.189
R2 score: 0.669
Binning process statistics¶
The binning process of the pipeline can be retrieved to show information about the problem and timing statistics.
[10]:
binning_process.information(print_level=1)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Statistics
Number of records 404
Number of variables 13
Target type continuous
Number of numerical 12
Number of categorical 1
Number of selected 13
Time 2.1547 sec
The summary
method returns basic statistics for each binned variable.
[11]:
binning_process.summary()
[11]:
name | dtype | status | selected | n_bins | woe | quality_score | |
---|---|---|---|---|---|---|---|
0 | CRIM | numerical | OPTIMAL | True | 10 | 93.949585 | 0.001813 |
1 | ZN | numerical | OPTIMAL | True | 3 | 62.306865 | 0.317480 |
2 | INDUS | numerical | OPTIMAL | True | 7 | 75.118213 | 0.047208 |
3 | CHAS | categorical | OPTIMAL | True | 2 | 52.476876 | 0.099815 |
4 | NOX | numerical | OPTIMAL | True | 7 | 86.777245 | 0.073814 |
5 | RM | numerical | OPTIMAL | True | 9 | 104.021845 | 0.029845 |
6 | AGE | numerical | OPTIMAL | True | 9 | 75.403568 | 0.002654 |
7 | DIS | numerical | OPTIMAL | True | 8 | 73.097554 | 0.010411 |
8 | RAD | numerical | OPTIMAL | True | 4 | 62.798802 | 0.373463 |
9 | TAX | numerical | OPTIMAL | True | 6 | 74.957266 | 0.188937 |
10 | PTRATIO | numerical | OPTIMAL | True | 7 | 73.905115 | 0.019986 |
11 | B | numerical | OPTIMAL | True | 9 | 83.013298 | 0.059642 |
12 | LSTAT | numerical | OPTIMAL | True | 12 | 124.279471 | 0.001798 |