Binning process

class optbinning.BinningProcess(variable_names, max_n_prebins=20, min_prebin_size=0.05, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, max_pvalue=None, max_pvalue_policy='consecutive', selection_criteria=None, fixed_variables=None, categorical_variables=None, special_codes=None, split_digits=None, binning_fit_params=None, binning_transform_params=None, n_jobs=None, verbose=False)

Bases: optbinning.binning.base.Base, sklearn.base.BaseEstimator, optbinning.binning.binning_process.BaseBinningProcess

Binning process to compute optimal binning of variables in a dataset, given a binary, continuous or multiclass target dtype.

Parameters
  • variable_names (array-like) – List of variable names.

  • max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).

  • min_prebin_size (float (default=0.05)) – The fraction of mininum number of records for each prebin.

  • min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then min_n_bins is a value in [0, max_n_prebins].

  • max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then max_n_bins is a value in [0, max_n_prebins].

  • min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None, min_bin_size = min_prebin_size.

  • max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None, max_bin_size = 1.0.

  • max_pvalue (float or None, optional (default=0.05)) – The maximum p-value among bins.

  • max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.

  • selection_criteria (dict or None (default=None)) –

    Variable selection criteria. See notes.

    New in version 0.6.0.

  • fixed_variables (array-like or None) –

    List of variables to be fixed. The binning process will retain these variables if the selection criteria is not satisfied.

    New in version 0.12.1.

  • special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

  • split_digits (int or None, optional (default=None)) – The significant digits of the split points. If split_digits is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.

  • categorical_variables (array-like or None, optional (default=None)) – List of variables numerical variables to be considered categorical. These are nominal variables. Not applicable when target type is multiclass.

  • binning_fit_params (dict or None, optional (default=None)) – Dictionary with optimal binning fitting options for specific variables. Example: {"variable_1": {"max_n_bins": 4}}.

  • binning_transform_params (dict or None, optional (default=None)) – Dictionary with optimal binning transform options for specific variables. Example {"variable_1": {"metric": "event_rate"}}.

  • n_jobs (int or None, optional (default=None)) –

    Number of cores to run in parallel while binning variables. None means 1 core. -1 means using all processors.

    New in version 0.7.1.

  • verbose (bool (default=False)) – Enable verbose output.

Notes

Parameter selection_criteria allows to specify criteria for variable selection. The input is a dictionary as follows

selection_criteria = {
    "metric_1":
        {
            "min": 0, "max": 1, "strategy": "highest", "top": 0.25
        },
    "metric_2":
        {
            "min": 0.02
        }
}

where several metrics can be combined. For example, above dictionary indicates that top 25% variables with “metric_1” in [0, 1] and “metric_2” greater or equal than 0.02 are selected. Supported key values are:

  • keys min and max support numerical values.

  • key strategy supports options “highest” and “lowest”.

  • key top supports an integer or decimal (percentage).

Warning

If the binning process instance is going to be saved, do not pass the option "solver": "mip" via the binning_fit_params parameter.

fit(X, y, sample_weight=None, check_input=False)

Fit the binning process. Fit the optimal binning to all variables according to the given training data.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) –

    Training vector, where n_samples is the number of samples.

    Changed in version 0.4.0.

    X supports numpy.ndarray and pandas.DataFrame.

  • y (array-like of shape (n_samples,)) – Target vector relative to x.

  • sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Only applied if prebinning_method="cart". This option is only available for a binary target.

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

self – Fitted binning process.

Return type

BinningProcess

fit_disk(input_path, target, **kwargs)

Fit the binning process according to the given training data on disk.

Parameters
  • input_path (str) – Any valid string path to a file with extension .csv or .parquet.

  • target (str) – Target column.

  • **kwargs (keyword arguments) – Keyword arguments for pandas.read_csv or pandas.read_parquet.

Returns

self – Fitted binning process.

Return type

BinningProcess

fit_from_dict(dict_optb)

Fit the binning process from a dict of OptimalBinning objects already fitted.

Parameters

dict_optb (dict) – Dictionary with OptimalBinning objects for binary, continuous or multiclass target. All objects must share the same class.

Returns

self – Fitted binning process.

Return type

BinningProcess

fit_transform(X, y, sample_weight=None, metric=None, metric_special=0, metric_missing=0, show_digits=2, check_input=False)

Fit the binning process according to the given training data, then transform it.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples.

  • y (array-like of shape (n_samples,)) – Target vector relative to x.

  • sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Only applied if prebinning_method="cart". This option is only available for a binary target.

  • metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

X_new – Transformed array.

Return type

numpy array, shape = (n_samples, n_features_new)

fit_transform_disk(input_path, output_path, target, chunksize, metric=None, metric_special=0, metric_missing=0, show_digits=2, **kwargs)

Fit the binning process according to the given training data on disk, then transform it and save to comma-separated values (csv) file.

Parameters
  • input_path (str) – Any valid string path to a file with extension .csv.

  • output_path (str) – Any valid string path to a file with extension .csv.

  • target (str) – Target column.

  • chunksize – Rows to read, transform and write at a time.

  • metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • **kwargs (keyword arguments) – Keyword arguments for pandas.read_csv.

Returns

self – Fitted binning process.

Return type

BinningProcess

get_binned_variable(name)

Return optimal binning object for a given variable name.

Parameters

name (string) – The variable name.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

get_support(indices=False, names=False)

Get a mask, or integer index, or names of the variables selected.

Parameters
  • indices (boolean (default=False)) – If True, the return value will be an array of integers, rather than a boolean mask.

  • names (boolean (default=False)) – If True, the return value will be an array of strings, rather than a boolean mask.

Returns

support – An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector. If names is True, this is an string array of sahpe [# output features], whose values are names of the selected features.

Return type

array

information(print_level=1)

Print overview information about the options settings and statistics.

Parameters

print_level (int (default=1)) – Level of details.

classmethod load(path)

Load binning process from pickle file.

Parameters

path (str) – Pickle file path.

Example

>>> from optbinning import BinningProcess
>>> binning_process = BinningProcess.load("my_binning_process.pkl")
save(path)

Save binning process to pickle file.

Parameters

path (str) – Pickle file path.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

summary()

Binning process summary with main statistics for all binned variables.

Parameters

df_summary (pandas.DataFrame) – Binning process summary.

transform(X, metric=None, metric_special=0, metric_missing=0, show_digits=2, check_input=False)

Transform given data to metric using bins from each fitted optimal binning.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples.

  • metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

  • X_new (numpy array or pandas.DataFrame, shape = (n_samples,)

  • n_features_new) – Transformed array.

transform_disk(input_path, output_path, chunksize, metric=None, metric_special=0, metric_missing=0, show_digits=2, **kwargs)

Transform given data on disk to metric using bins from each fitted optimal binning. Save to comma-separated values (csv) file.

Parameters
  • input_path (str) – Any valid string path to a file with extension .csv.

  • output_path (str) – Any valid string path to a file with extension .csv.

  • chunksize – Rows to read, transform and write at a time.

  • metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • **kwargs (keyword arguments) – Keyword arguments for pandas.read_csv.

Returns

self – Fitted binning process.

Return type

BinningProcess

update_binned_variable(name, optb)

Update optimal binning object for a given variable.

Parameters
  • name (string) – The variable name.

  • optb (object) – The optimal binning object already fitted.