Optimal piecewise binning with binary target¶

class
optbinning.
OptimalPWBinning
(name='', estimator=None, objective='l2', degree=1, continuous=True, continuous_deriv=True, prebinning_method='cart', max_n_prebins=20, min_prebin_size=0.05, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, monotonic_trend='auto', n_subsamples=None, max_pvalue=None, max_pvalue_policy='consecutive', outlier_detector=None, outlier_params=None, user_splits=None, user_splits_fixed=None, special_codes=None, split_digits=None, solver='auto', h_epsilon=1.35, quantile=0.5, regularization=None, reg_l1=1.0, reg_l2=1.0, random_state=None, verbose=False)¶ Bases:
optbinning.binning.piecewise.base.BasePWBinning
Optimal Piecewise binning of a numerical variable with respect to a binary target.
 Parameters
name (str, optional (default="")) – The variable name.
estimator (object or None (default=None)) – An esimator to compute probability estimates. If None, it uses sklearn.linear_model.LogisticRegression. The estimator must be an object with method predict_proba.
objective (str, optional (default="l2")) – The objective function. Supported objectives are “l2”, “l1”, “huber” and “quantile”. Note that “l1”, “huber” and “quantile” are robust objective functions.
degree (int (default=1)) –
The degree of the polynomials.
degree = 0: piecewise constant functions.
degree = 1: piecewise linear functions.
degree > 1: piecewise polynomial functions.
continuous (bool (default=True)) – Whether to fit a continuous or discontinuous piecewise regression.
continuous_deriv (bool (default=True)) – Whether to fit a polynomial with continuous derivatives. This option fits a smooth degree dpolynomial with d1 continuity in derivatives (splines).
prebinning_method (str, optional (default="cart")) – The prebinning method. Supported methods are “cart” for a CART decision tree, “quantile” to generate prebins with approximately same frequency and “uniform” to generate prebins with equal width. Method “cart” uses sklearn.tree.DecistionTreeClassifier.
max_n_prebins (int (default=20)) – The maximum number of bins after prebinning (prebins).
min_prebin_size (float (default=0.05)) – The fraction of mininum number of records for each prebin.
min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then
min_n_bins
is a value in[0, max_n_prebins]
.max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then
max_n_bins
is a value in[0, max_n_prebins]
.min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None,
min_bin_size = min_prebin_size
.max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None,
max_bin_size = 1.0
.monotonic_trend (str or None, optional (default="auto")) – The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc” to automatically determine the trend maximizing IV using a machine learning classifier, “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point, and “valley” and “valley_heuristic” to allow a valley change point. Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point, and are significantly faster for large size instances (
max_n_prebins > 20
). Trend “auto_asc_desc” is used to automatically select the best monotonic trend between “ascending” and “descending”. If None, then the monotonic constraint is disabled.n_subsamples (int or None (default=None)) – Number of subsamples to fit the piecewise regression algorithm. If None, all values are considered.
max_pvalue (float or None, optional (default=None)) – The maximum pvalue among bins. The Ztest is used to detect bins not satisfying the pvalue constraint. Option supported by solvers “cp” and “mip”.
max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the pvalue constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.
outlier_detector (str or None, optional (default=None)) – The outlier detection method. Supported methods are “range” to use the interquartile range based method, “zcore” to use the modified Zscore method or “yquantile” to use the yaxis detector over quantiles.
outlier_params (dict or None, optional (default=None)) – Dictionary of parameters to pass to the outlier detection method.
user_splits (arraylike or None, optional (default=None)) – The list of prebinning split points when
dtype
is “numerical” or the list of prebins whendtype
is “categorical”.user_splits_fixed (arraylike or None (default=None)) – The list of prebinning split points that must be fixed.
special_codes (arraylike, dict or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
split_digits (int or None, optional (default=None)) – The significant digits of the split points. If
split_digits
is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.solver (str, optional (default="auto")) – The optimizer to solve the underlying mathematical optimization problem. Supported solvers are “ecos”, “osqp”, “direct”, to choose the direct solver, and “auto”, to choose the most appropriate solver for the problem. Version 0.16.1 added support to solvers “scs” and “highs”.
h_epsilon (float (default=1.35)) – The parameter h_epsilon used when
objective="huber"
, controls the number of samples that should be classified as outliers.quantile (float (default=0.5)) – The parameter quantile is the qth quantile to be used when
objective="quantile"
.regularization (str or None (default=None)) – Type of regularization. Supported regularization are “l1” (Lasso) and “l2” (Ridge). If None, no regularization is applied.
reg_l1 (float (default=1.0)) – L1 regularization term. Increasing this value will smooth the regression model. Only applicable if
regularization="l1"
.reg_l2 (float (default=1.0)) – L2 regularization term. Increasing this value will smooth the regression model. Only applicable if
regularization="l2"
.random_state (int, RandomState instance or None, (default=None)) – If
n_subsamples < n_samples
, controls the shuffling applied to the data before applying the split.verbose (bool (default=False)) – Enable verbose output.

property
binning_table
¶ Return an instantiated binning table. Please refer to Binning table: binary target.
 Returns
binning_table
 Return type
BinningTable.

fit
(x, y, lb=None, ub=None, check_input=False)¶ Fit the optimal piecewise binning according to the given training data.
 Parameters
x (arraylike, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
y (arraylike, shape = (n_samples,)) – Target vector relative to x.
check_input (bool (default=False)) – Whether to check input arrays.
 Returns
self – Fitted optimal piecewise binning.
 Return type
BasePWBinning

fit_transform
(x, y, metric='woe', metric_special=0, metric_missing=0, lb=None, ub=None, check_input=False)¶ Fit the optimal piecewise binning according to the given training data, then transform it.
 Parameters
x (arraylike, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
y (arraylike, shape = (n_samples,)) – Target vector relative to x.
metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence and “event_rate” to choose the event rate.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate, and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
lb (float or None (default=None)) – Avoid values below the lower bound lb.
ub (float or None (default=None)) – Avoid values above the upper bound ub.
check_input (bool (default=False)) – Whether to check input arrays.
 Returns
x_new – Transformed array.
 Return type
numpy array, shape = (n_samples,)

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
params – Parameter names mapped to their values.
 Return type
dict

information
(print_level=1)¶ Print overview information about the options settings, problem statistics, and the solution of the computation.
 Parameters
print_level (int (default=1)) – Level of details.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
**params (dict) – Estimator parameters.
 Returns
self – Estimator instance.
 Return type
estimator instance

property
splits
¶ List of optimal split points when
dtype
is set to “numerical” or list of optimal bins whendtype
is set to “categorical”. Returns
splits
 Return type
numpy.ndarray

property
status
¶ The status of the underlying optimization solver.
 Returns
status
 Return type
str

transform
(x, metric='woe', metric_special=0, metric_missing=0, lb=None, ub=None, check_input=False)¶ Transform given data to Weight of Evidence (WoE) or event rate using bins from the fitted optimal piecewise binning.
 Parameters
x (arraylike, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence and “event_rate” to choose the event rate.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate, and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
lb (float or None (default=None)) – Avoid values below the lower bound lb.
ub (float or None (default=None)) – Avoid values above the upper bound ub.
check_input (bool (default=False)) – Whether to check input arrays.
 Returns
x_new – Transformed array.
 Return type
numpy array, shape = (n_samples,)