Binning process¶

class optbinning.BinningProcess(variable_names, max_n_prebins=20, min_prebin_size=0.05, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, max_pvalue=None, max_pvalue_policy='consecutive', selection_criteria=None, fixed_variables=None, categorical_variables=None, special_codes=None, split_digits=None, binning_fit_params=None, binning_transform_params=None, n_jobs=None, verbose=False)¶

Bases: optbinning.binning.base.Base, sklearn.base.BaseEstimator, optbinning.binning.binning_process.BaseBinningProcess

Binning process to compute optimal binning of variables in a dataset, given a binary, continuous or multiclass target dtype.

Parameters

variable_names (array-like) – List of variable names.
max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).
min_prebin_size (float (default=0.05)) – The fraction of mininum number of records for each prebin.
min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then min_n_bins is a value in [0, max_n_prebins].
max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then max_n_bins is a value in [0, max_n_prebins].
min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None, min_bin_size = min_prebin_size.
max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None, max_bin_size = 1.0.
max_pvalue (float or None, optional (default=None)) – The maximum p-value among bins.
max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.
selection_criteria (dict or None (default=None)) –
Variable selection criteria. See notes.

New in version 0.6.0.
fixed_variables (array-like or None) –
List of variables to be fixed. The binning process will retain these variables if the selection criteria is not satisfied.

New in version 0.12.1.
special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
split_digits (int or None, optional (default=None)) – The significant digits of the split points. If split_digits is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.
categorical_variables (array-like or None, optional (default=None)) – List of variables numerical variables to be considered categorical. These are nominal variables. Not applicable when target type is multiclass.
binning_fit_params (dict or None, optional (default=None)) – Dictionary with optimal binning fitting options for specific variables. Example: {"variable_1": {"max_n_bins": 4}}.
binning_transform_params (dict or None, optional (default=None)) – Dictionary with optimal binning transform options for specific variables. Example {"variable_1": {"metric": "event_rate"}}.
n_jobs (int or None, optional (default=None)) –
Number of cores to run in parallel while binning variables. None means 1 core. -1 means using all processors.

New in version 0.7.1.
verbose (bool (default=False)) – Enable verbose output.

Notes

Parameter selection_criteria allows to specify criteria for variable selection. The input is a dictionary as follows

selection_criteria = {
    "metric_1":
        {
            "min": 0, "max": 1, "strategy": "highest", "top": 0.25
        },
    "metric_2":
        {
            "min": 0.02
        }
}

where several metrics can be combined. For example, above dictionary indicates that top 25% variables with “metric_1” in [0, 1] and “metric_2” greater or equal than 0.02 are selected. Supported key values are:

keys min and max support numerical values.
key strategy supports options “highest” and “lowest”.
key top supports an integer or decimal (percentage).

Warning

If the binning process instance is going to be saved, do not pass the option "solver": "mip" via the binning_fit_params parameter.

fit(X, y, sample_weight=None, check_input=False)¶

Fit the binning process. Fit the optimal binning to all variables according to the given training data.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) –
Training vector, where n_samples is the number of samples.

Changed in version 0.4.0.

X supports numpy.ndarray and pandas.DataFrame.
y (array-like of shape (n_samples,)) – Target vector relative to x.
sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Only applied if prebinning_method="cart". This option is only available for a binary target.
check_input (bool (default=False)) – Whether to check input arrays.

Returns

self – Fitted binning process.

Return type

BinningProcess

fit_disk(input_path, target, **kwargs)¶

Fit the binning process according to the given training data on disk.

Parameters

input_path (str) – Any valid string path to a file with extension .csv or .parquet.
target (str) – Target column.
**kwargs (keyword arguments) – Keyword arguments for pandas.read_csv or pandas.read_parquet.

Returns

self – Fitted binning process.

Return type

BinningProcess

fit_from_dict(dict_optb)¶

Fit the binning process from a dict of OptimalBinning objects already fitted.

Parameters: dict_optb (dict) – Dictionary with OptimalBinning objects for binary, continuous or multiclass target. All objects must share the same class.
Returns: self – Fitted binning process.
Return type: BinningProcess

fit_transform(X, y, sample_weight=None, metric=None, metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶

Fit the binning process according to the given training data, then transform it.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples.
y (array-like of shape (n_samples,)) – Target vector relative to x.
sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Only applied if prebinning_method="cart". This option is only available for a binary target.
metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".
check_input (bool (default=False)) – Whether to check input arrays.

Returns

X_new – Transformed array.

Return type

numpy array, shape = (n_samples, n_features_new)

fit_transform_disk(input_path, output_path, target, chunksize, metric=None, metric_special=0, metric_missing=0, show_digits=2, **kwargs)¶

Fit the binning process according to the given training data on disk, then transform it and save to comma-separated values (csv) file.

Parameters

input_path (str) – Any valid string path to a file with extension .csv.
output_path (str) – Any valid string path to a file with extension .csv.
target (str) – Target column.
chunksize – Rows to read, transform and write at a time.
metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".
**kwargs (keyword arguments) – Keyword arguments for pandas.read_csv.

Returns

self – Fitted binning process.

Return type

BinningProcess

get_binned_variable(name)¶

Return optimal binning object for a given variable name.

Parameters: name (string) – The variable name.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

get_support(indices=False, names=False)¶

Get a mask, or integer index, or names of the variables selected.

Parameters

indices (boolean (default=False)) – If True, the return value will be an array of integers, rather than a boolean mask.
names (boolean (default=False)) – If True, the return value will be an array of strings, rather than a boolean mask.

Returns

support – An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector. If names is True, this is an string array of sahpe [# output features], whose values are names of the selected features.

Return type

array

information(print_level=1)¶

Print overview information about the options settings and statistics.

Parameters: print_level (int (default=1)) – Level of details.

classmethod load(path)¶

Load binning process from pickle file.

Parameters: path (str) – Pickle file path.

Example

>>> from optbinning import BinningProcess
>>> binning_process = BinningProcess.load("my_binning_process.pkl")

save(path)¶

Save binning process to pickle file.

Parameters: path (str) – Pickle file path.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

summary()¶

Binning process summary with main statistics for all binned variables.

Parameters: df_summary (pandas.DataFrame) – Binning process summary.

transform(X, metric=None, metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶

Transform given data to metric using bins from each fitted optimal binning.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples.
metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".
check_input (bool (default=False)) – Whether to check input arrays.

Returns

X_new (numpy array or pandas.DataFrame, shape = (n_samples,)
n_features_new) – Transformed array.

transform_disk(input_path, output_path, chunksize, metric=None, metric_special=0, metric_missing=0, show_digits=2, **kwargs)¶

Transform given data on disk to metric using bins from each fitted optimal binning. Save to comma-separated values (csv) file.

Parameters

input_path (str) – Any valid string path to a file with extension .csv.
output_path (str) – Any valid string path to a file with extension .csv.
chunksize – Rows to read, transform and write at a time.
metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".
**kwargs (keyword arguments) – Keyword arguments for pandas.read_csv.

Returns

self – Fitted binning process.

Return type

BinningProcess

update_binned_variable(name, optb)¶

Update optimal binning object for a given variable.

Parameters

name (string) – The variable name.
optb (object) – The optimal binning object already fitted.