Binning process sketch with binary target¶

class optbinning.BinningProcessSketch(variable_names, max_n_prebins=20, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, max_pvalue=None, max_pvalue_policy='consecutive', selection_criteria=None, categorical_variables=None, special_codes=None, split_digits=None, binning_fit_params=None, binning_transform_params=None, verbose=False)¶

Bases: optbinning.binning.distributed.base.BaseSketch, sklearn.base.BaseEstimator, optbinning.binning.binning_process.BaseBinningProcess

Binning process over data streams to compute optimal binning of variables with respect to a binary target.

Parameters

variable_names (array-like) – List of variable names.
max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).
min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then min_n_bins is a value in [0, max_n_prebins].
max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then max_n_bins is a value in [0, max_n_prebins].
min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None, min_bin_size = min_prebin_size.
max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None, max_bin_size = 1.0.
max_pvalue (float or None, optional (default=None)) – The maximum p-value among bins.
max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.
selection_criteria (dict or None (default=None)) – Variable selection criteria. See notes.
special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
split_digits (int or None, optional (default=None)) – The significant digits of the split points. If split_digits is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.
categorical_variables (array-like or None, optional (default=None)) – List of variables numerical variables to be considered categorical. These are nominal variables. Not applicable when target type is multiclass.
binning_fit_params (dict or None, optional (default=None)) – Dictionary with optimal binning fitting options for specific variables. Example: {"variable_1": {"max_n_bins": 4}}.
binning_transform_params (dict or None, optional (default=None)) – Dictionary with optimal binning transform options for specific variables. Example {"variable_1": {"metric": "event_rate"}}.
verbose (bool (default=False)) – Enable verbose output.

Notes

Parameter selection_criteria allows to specify criteria for variable selection. The input is a dictionary as follows

selection_criteria = {
    "metric_1":
        {
            "min": 0, "max": 1, "strategy": "highest", "top": 0.25
        },
    "metric_2":
        {
            "min": 0.02
        }
}

where several metrics can be combined. For example, above dictionary indicates that top 25% variables with “metric_1” in [0, 1] and “metric_2” greater or equal than 0.02 are selected. Supported key values are:

keys min and max support numerical values.
key strategy supports options “highest” and “lowest”.
key top supports an integer or decimal (percentage).

Warning

If the binning process instance is going to be saved, do not pass the option "solver": "mip" via the binning_fit_params parameter.

add(X, y, check_input=False)¶

Add new data X, y to the binning sketch of each variable.

Parameters

X (pandas.DataFrame, shape (n_samples, n_features)) –
y (array-like of shape (n_samples,)) – Target vector relative to x.
check_input (bool (default=False)) – Whether to check input arrays.

Returns

self – Binning process with new data.

Return type

BinningProcessSketch

get_binned_variable(name)¶

Return optimal binning sketch object for a given variable name.

Parameters: name (string) – The variable name.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

get_support(indices=False, names=False)¶

Get a mask, or integer index, or names of the variables selected.

Parameters

indices (boolean (default=False)) – If True, the return value will be an array of integers, rather than a boolean mask.
names (boolean (default=False)) – If True, the return value will be an array of strings, rather than a boolean mask.

Returns

support – An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector. If names is True, this is an string array of sahpe [# output features], whose values are names of the selected features.

Return type

array

information(print_level=1)¶

Print overview information about the options settings and statistics.

Parameters: print_level (int (default=1)) – Level of details.

classmethod load(path)¶

Load binning process from pickle file.

Parameters: path (str) – Pickle file path.

Example

>>> from optbinning import BinningProcess
>>> binning_process = BinningProcess.load("my_binning_process.pkl")

merge(bpsketch)¶

Merge current instance with another BinningProcessSketch instance.

Parameters: bpsketch (object) – BinningProcessSketch instance.

mergeable(bpsketch)¶

Check whether two BinningProcessSketch instances can be merged.

Parameters: bpsketch (object) – BinningProcessSketch instance.
Returns: mergeable
Return type: bool

save(path)¶

Save binning process to pickle file.

Parameters: path (str) – Pickle file path.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

solve()¶

Solve optimal binning for all variables using added data.

Returns: self – Current fitted binning process.
Return type: BinningProcessSketch

summary()¶

Binning process summary with main statistics for all binned variables.

Parameters: df_summary (pandas.DataFrame) – Binning process summary.

transform(X, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶

Transform given data to metric using bins from each fitted optimal binning.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples.
metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".
check_input (bool (default=False)) – Whether to check input arrays.

Returns

X_new – Transformed array.

Return type

pandas.DataFrame, shape = (n_samples, n_features_new)