Binning process sketch with binary target

class optbinning.BinningProcessSketch(variable_names, max_n_prebins=20, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, max_pvalue=None, max_pvalue_policy='consecutive', selection_criteria=None, categorical_variables=None, special_codes=None, split_digits=None, binning_fit_params=None, binning_transform_params=None, verbose=False)

Bases: optbinning.binning.distributed.base.BaseSketch, sklearn.base.BaseEstimator, optbinning.binning.binning_process.BaseBinningProcess

Binning process over data streams to compute optimal binning of variables with respect to a binary target.

Parameters
  • variable_names (array-like) – List of variable names.

  • max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).

  • min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then min_n_bins is a value in [0, max_n_prebins].

  • max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then max_n_bins is a value in [0, max_n_prebins].

  • min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None, min_bin_size = min_prebin_size.

  • max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None, max_bin_size = 1.0.

  • max_pvalue (float or None, optional (default=0.05)) – The maximum p-value among bins.

  • max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.

  • selection_criteria (dict or None (default=None)) – Variable selection criteria. See notes.

  • special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

  • split_digits (int or None, optional (default=None)) – The significant digits of the split points. If split_digits is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.

  • categorical_variables (array-like or None, optional (default=None)) – List of variables numerical variables to be considered categorical. These are nominal variables. Not applicable when target type is multiclass.

  • binning_fit_params (dict or None, optional (default=None)) – Dictionary with optimal binning fitting options for specific variables. Example: {"variable_1": {"max_n_bins": 4}}.

  • binning_transform_params (dict or None, optional (default=None)) – Dictionary with optimal binning transform options for specific variables. Example {"variable_1": {"metric": "event_rate"}}.

  • verbose (bool (default=False)) – Enable verbose output.

Notes

Parameter selection_criteria allows to specify criteria for variable selection. The input is a dictionary as follows

selection_criteria = {
    "metric_1":
        {
            "min": 0, "max": 1, "strategy": "highest", "top": 0.25
        },
    "metric_2":
        {
            "min": 0.02
        }
}

where several metrics can be combined. For example, above dictionary indicates that top 25% variables with “metric_1” in [0, 1] and “metric_2” greater or equal than 0.02 are selected. Supported key values are:

  • keys min and max support numerical values.

  • key strategy supports options “highest” and “lowest”.

  • key top supports an integer or decimal (percentage).

Warning

If the binning process instance is going to be saved, do not pass the option "solver": "mip" via the binning_fit_params parameter.

add(X, y, check_input=False)

Add new data X, y to the binning sketch of each variable.

Parameters
  • X (pandas.DataFrame, shape (n_samples, n_features)) –

  • y (array-like of shape (n_samples,)) – Target vector relative to x.

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

self – Binning process with new data.

Return type

BinningProcessSketch

get_binned_variable(name)

Return optimal binning sketch object for a given variable name.

Parameters

name (string) – The variable name.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

get_support(indices=False, names=False)

Get a mask, or integer index, or names of the variables selected.

Parameters
  • indices (boolean (default=False)) – If True, the return value will be an array of integers, rather than a boolean mask.

  • names (boolean (default=False)) – If True, the return value will be an array of strings, rather than a boolean mask.

Returns

support – An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector. If names is True, this is an string array of sahpe [# output features], whose values are names of the selected features.

Return type

array

information(print_level=1)

Print overview information about the options settings and statistics.

Parameters

print_level (int (default=1)) – Level of details.

classmethod load(path)

Load binning process from pickle file.

Parameters

path (str) – Pickle file path.

Example

>>> from optbinning import BinningProcess
>>> binning_process = BinningProcess.load("my_binning_process.pkl")
merge(bpsketch)

Merge current instance with another BinningProcessSketch instance.

Parameters

bpsketch (object) – BinningProcessSketch instance.

mergeable(bpsketch)

Check whether two BinningProcessSketch instances can be merged.

Parameters

bpsketch (object) – BinningProcessSketch instance.

Returns

mergeable

Return type

bool

save(path)

Save binning process to pickle file.

Parameters

path (str) – Pickle file path.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

solve()

Solve optimal binning for all variables using added data.

Returns

self – Current fitted binning process.

Return type

BinningProcessSketch

summary()

Binning process summary with main statistics for all binned variables.

Parameters

df_summary (pandas.DataFrame) – Binning process summary.

transform(X, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)

Transform given data to metric using bins from each fitted optimal binning.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples.

  • metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

X_new – Transformed array.

Return type

pandas.DataFrame, shape = (n_samples, n_features_new)