Optimal binning with continuous target

class optbinning.ContinuousOptimalBinning(name='', dtype='numerical', prebinning_method='cart', max_n_prebins=20, min_prebin_size=0.05, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, monotonic_trend='auto', min_mean_diff=0, max_pvalue=None, max_pvalue_policy='consecutive', outlier_detector=None, outlier_params=None, cat_cutoff=None, user_splits=None, user_splits_fixed=None, special_codes=None, split_digits=None, time_limit=100, verbose=False, **prebinning_kwargs)

Bases: optbinning.binning.binning.OptimalBinning

Optimal binning of a numerical or categorical variable with respect to a continuous target.

Parameters
  • name (str, optional (default="")) – The variable name.

  • dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.

  • prebinning_method (str, optional (default="cart")) – The pre-binning method. Supported methods are “cart” for a CART decision tree, “quantile” to generate prebins with approximately same frequency and “uniform” to generate prebins with equal width. Method “cart” uses sklearn.tree.DecisionTreeRegressor.

  • max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).

  • min_prebin_size (float (default=0.05)) – The fraction of mininum number of records for each prebin.

  • min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then min_n_bins is a value in [0, max_n_prebins].

  • max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then max_n_bins is a value in [0, max_n_prebins].

  • min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None, min_bin_size = min_prebin_size.

  • max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None, max_bin_size = 1.0.

  • monotonic_trend (str or None, optional (default="auto")) – The mean monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc” to automatically determine the trend minimize the L1-norm using a machine learning classifier, “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point, and “valley” and “valley_heuristic” to allow a valley change point. Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point, and are significantly faster for large size instances (max_n_prebins> 20). Trend “auto_asc_desc” is used to automatically select the best monotonic trend between “ascending” and “descending”. If None, then the monotonic constraint is disabled.

  • min_mean_diff (float, optional (default=0)) – The minimum mean difference between consecutives bins. This option currently only applies when monotonic_trend is “ascending” or “descending”.

  • max_pvalue (float or None, optional (default=0.05)) – The maximum p-value among bins. The T-test is used to detect bins not satisfying the p-value constraint.

  • max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.

  • outlier_detector (str or None, optional (default=None)) – The outlier detection method. Supported methods are “range” to use the interquartile range based method or “zcore” to use the modified Z-score method.

  • outlier_params (dict or None, optional (default=None)) – Dictionary of parameters to pass to the outlier detection method.

  • cat_cutoff (float or None, optional (default=None)) – Generate bin others with categories in which the fraction of occurrences is below the cat_cutoff value. This option is available when dtype is “categorical”.

  • user_splits (array-like or None, optional (default=None)) – The list of pre-binning split points when dtype is “numerical” or the list of prebins when dtype is “categorical”.

  • user_splits_fixed (array-like or None (default=None)) – The list of pre-binning split points that must be fixed.

  • special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

  • split_digits (int or None, optional (default=None)) – The significant digits of the split points. If split_digits is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.

  • time_limit (int (default=100)) – The maximum time in seconds to run the optimization solver.

  • verbose (bool (default=False)) – Enable verbose output.

  • **prebinning_kwargs (keyword arguments) –

    The pre-binning keywrord arguments.

    New in version 0.6.1.

Notes

The parameter values max_n_prebins and min_prebin_size control complexity and memory usage. The default values generally produce quality results, however, some improvement can be achieved by increasing max_n_prebins and/or decreasing min_prebin_size.

The T-test uses an estimate of the standard deviation of the contingency table to speed up the model generation and reduce memory usage. Therefore, it is not guaranteed to obtain bins satisfying the p-value constraint, although it may work reasonably well in most cases. To avoid having bins with similar bins the parameter min_mean_diff is recommended.

property binning_table

Return an instantiated binning table. Please refer to Binning table: continuous target.

Returns

binning_table

Return type

ContinuousBinningTable.

fit(x, y, check_input=False)

Fit the optimal binning according to the given training data.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.

  • y (array-like, shape = (n_samples,)) – Target vector relative to x.

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

self – Fitted optimal binning.

Return type

ContinuousOptimalBinning

fit_transform(x, y, metric='mean', metric_special=0, metric_missing=0, show_digits=2, check_input=False)

Fit the optimal binning according to the given training data, then transform it.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.

  • y (array-like, shape = (n_samples,)) – Target vector relative to x.

  • metric (str (default="mean"):) – The metric used to transform the input vector. Supported metrics are “mean” to choose the mean, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical mean, and any numerical value.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical mean, and any numerical value.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

x_new – Transformed array.

Return type

numpy array, shape = (n_samples,)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

information(print_level=1)

Print overview information about the options settings, problem statistics, and the solution of the computation.

Parameters

print_level (int (default=1)) – Level of details.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

property splits

List of optimal split points when dtype is set to “numerical” or list of optimal bins when dtype is set to “categorical”.

Returns

splits

Return type

numpy.ndarray

property status

The status of the underlying optimization solver.

Returns

status

Return type

str

transform(x, metric='mean', metric_special=0, metric_missing=0, show_digits=2, check_input=False)

Transform given data to mean using bins from the fitted optimal binning.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.

  • metric (str (default="mean"):) – The metric used to transform the input vector. Supported metrics are “mean” to choose the mean, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical mean, and any numerical value.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical mean, and any numerical value.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

x_new – Transformed array.

Return type

numpy array, shape = (n_samples,)

Notes

Transformation of data including categories not present during training return zero mean.