Stochastic optimal binning

Introduction

The data used when performing optimal binning is generally assumed to be known accurately and being fully representative of past, present, and future data. This confidence might produce misleading results, especially with data representing future events such as product demand, churn rate, or probability of default.

Stochastic programming is a framework for explicitly incorporating uncertainty. Stochastic programming uses random variables to account for data variability and optimizes the expected value of the objective function. Optbinning implements the stochastic programming approach using the two-stage scenario-based formulation (also known as extensive form or deterministic equivalent), obtaining a deterministic mixed-integer linear programming formulation. The scenario-based formulation guarantees the nonanticipativity constraint and a solution that must be feasible for each scenario, leading to a more robust solution.

Scenario-based optimal binning

class optbinning.binning.uncertainty.SBOptimalBinning(name='', prebinning_method='cart', max_n_prebins=20, min_prebin_size=0.05, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, monotonic_trend=None, min_event_rate_diff=0, max_pvalue=None, max_pvalue_policy='consecutive', class_weight=None, user_splits=None, user_splits_fixed=None, special_codes=None, split_digits=None, time_limit=100, verbose=False)

Bases: optbinning.binning.binning.OptimalBinning

Scenario-based stochastic optimal binning of a numerical variable with respect to a binary target.

Extensive form of the stochastic optimal binning given a finite number of scenarios. The goal is to maximize the expected IV obtaining a solution feasible for all scenarios.

Parameters
  • name (str, optional (default="")) – The variable name.

  • prebinning_method (str, optional (default="cart")) – The pre-binning method. Supported methods are “cart” for a CART decision tree, “quantile” to generate prebins with approximately same frequency and “uniform” to generate prebins with equal width. Method “cart” uses sklearn.tree.DecistionTreeClassifier.

  • max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).

  • min_prebin_size (float (default=0.05)) – The fraction of mininum number of records for each prebin.

  • min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then min_n_bins is a value in [0, max_n_prebins].

  • max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then max_n_bins is a value in [0, max_n_prebins].

  • min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None, min_bin_size = min_prebin_size.

  • max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None, max_bin_size = 1.0.

  • monotonic_trend (str or None, optional (default=None)) – The event rate monotonic trend. Supported trends are “ascending”, “descending”, “concave”, “convex”, “peak” and “valley”. If None, then the monotonic constraint is disabled.

  • min_event_rate_diff (float, optional (default=0)) – The minimum event rate difference between consecutives bins.

  • max_pvalue (float or None, optional (default=None)) – The maximum p-value among bins. The Z-test is used to detect bins not satisfying the p-value constraint.

  • max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.

  • class_weight (dict, "balanced" or None, optional (default=None)) –

    Weights associated with classes in the form {class_label: weight}. If None, all classes are supposed to have weight one. Check sklearn.tree.DecistionTreeClassifier.

  • user_splits (array-like or None, optional (default=None)) – The list of pre-binning split points when dtype is “numerical” or the list of prebins when dtype is “categorical”.

  • user_splits_fixed (array-like or None (default=None)) – The list of pre-binning split points that must be fixed.

  • special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

  • split_digits (int or None, optional (default=None)) – The significant digits of the split points. If split_digits is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.

  • time_limit (int (default=100)) – The maximum time in seconds to run the optimization solver.

  • verbose (bool (default=False)) – Enable verbose output.

property binning_table

Return an instantiated binning table. Please refer to Binning table: binary target.

Returns

binning_table

Return type

BinningTable

binning_table_scenario(scenario_id)

Return the instantiated binning table corresponding to scenario_id. Please refer to Binning table: binary target.

Parameters

scenario_id (int) – Scenario identifier.

Returns

binning_table

Return type

BinningTable

fit(X, Y, weights=None, check_input=False)

Fit the optimal binning given a list of scenarios.

Parameters
  • X (array-like, shape = (n_scenarios,)) – Lit of training vectors, where n_scenarios is the number of scenarios.

  • Y (array-like, shape = (n_scenarios,)) – List of target vectors relative to X.

  • weights (array-like, shape = (n_scenarios,)) – Scenarios weights. If None, then scenarios are equally weighted.

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

self – Fitted optimal binning.

Return type

SBOptimalBinning

fit_transform(x, X, Y, weights=None, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)

Fit the optimal binning given a list of scenarios, then transform it.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.

  • X (array-like, shape = (n_scenarios,)) – Lit of training vectors, where n_scenarios is the number of scenarios.

  • Y (array-like, shape = (n_scenarios,)) – List of target vectors relative to X.

  • weights (array-like, shape = (n_scenarios,)) – Scenarios weights. If None, then scenarios are equally weighted.

  • metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate, and any numerical value.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

x_new – Transformed array.

Return type

numpy array, shape = (n_samples,)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

information(print_level=1)

Print overview information about the options settings, problem statistics, and the solution of the computation.

Parameters

print_level (int (default=1)) – Level of details.

read_json(path)

Read json file containing split points and set them as the new split points.

Parameters

path (The path of the json file.) –

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

property splits

List of optimal split points.

Returns

splits

Return type

numpy.ndarray

property status

The status of the underlying optimization solver.

Returns

status

Return type

str

to_json(path)

Save optimal bins and/or splits points and transformation depending on the target type.

Parameters

path (The path where the json is going to be saved.) –

transform(x, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)

Transform given data to Weight of Evidence (WoE) or event rate using bins from the fitted optimal binning.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.

  • metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

x_new – Transformed array.

Return type

numpy array, shape = (n_samples,)

Notes

Transformation of data including categories not present during training return zero WoE or event rate.