Optimal binning sketch with binary target

Introduction

The optimal binning is the constrained discretization of a numerical feature into bins given a binary target, maximizing a statistic such as Jeffrey’s divergence or Gini. Binning is a data preprocessing technique commonly used in binary classification, but the current list of existing binning algorithms supporting constraints lacks a method to handle streaming data. The new class OptimalBinningSketch implements a new scalable, memory-efficient and robust algorithm for performing optimal binning in the streaming settings. Algorithmic details are discussed in http://gnpalencia.org/blog/2020/binning_data_streams/.

Algorithms

OptimalBinningSketch

class optbinning.binning.distributed.OptimalBinningSketch(name='', dtype='numerical', sketch='gk', eps=0.0001, K=25, solver='cp', divergence='iv', max_n_prebins=20, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, min_bin_n_nonevent=None, max_bin_n_nonevent=None, min_bin_n_event=None, max_bin_n_event=None, monotonic_trend='auto', min_event_rate_diff=0, max_pvalue=None, max_pvalue_policy='consecutive', gamma=0, cat_cutoff=None, cat_unknown=None, cat_heuristic=False, special_codes=None, split_digits=None, mip_solver='bop', time_limit=100, verbose=False)

Bases: optbinning.binning.distributed.base.BaseSketch, sklearn.base.BaseEstimator

Optimal binning over data streams of a numerical or categorical variable with respect to a binary target.

Parameters
  • name (str, optional (default="")) – The variable name.

  • dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.

  • sketch (str, optional (default="gk")) – Sketch algorithm. Supported algorithms are “gk” (Greenwald-Khanna’s) and “t-digest” (Ted Dunning) algorithm. Algorithm “t-digest” relies on tdigest.

  • eps (float, optional (default=1e-4)) – Relative error epsilon. For sketch="gk" this is the absolute precision, whereas for sketch="t-digest" is the relative precision.

  • K (int, optional (default=25)) – Parameter excess growth K to compute compress threshold in t-digest.

  • solver (str, optional (default="cp")) – The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” to choose a constrained programming solver or “ls” to choose LocalSolver.

  • divergence (str, optional (default="iv")) – The divergence measure in the objective function to be maximized. Supported divergences are “iv” (Information Value or Jeffrey’s divergence), “js” (Jensen-Shannon), “hellinger” (Hellinger divergence) and “triangular” (triangular discrimination).

  • max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).

  • min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then min_n_bins is a value in [0, max_n_prebins].

  • max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then max_n_bins is a value in [0, max_n_prebins].

  • min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None, min_bin_size = min_prebin_size.

  • max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None, max_bin_size = 1.0.

  • min_bin_n_nonevent (int or None, optional (default=None)) – The minimum number of non-event records for each bin. If None, min_bin_n_nonevent = 1.

  • max_bin_n_nonevent (int or None, optional (default=None)) – The maximum number of non-event records for each bin. If None, then an unlimited number of non-event records for each bin.

  • min_bin_n_event (int or None, optional (default=None)) – The minimum number of event records for each bin. If None, min_bin_n_event = 1.

  • max_bin_n_event (int or None, optional (default=None)) – The maximum number of event records for each bin. If None, then an unlimited number of event records for each bin.

  • monotonic_trend (str or None, optional (default="auto")) – The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc” to automatically determine the trend maximizing IV using a machine learning classifier, “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point, and “valley” and “valley_heuristic” to allow a valley change point. Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point, and are significantly faster for large size instances (max_n_prebins > 20). Trend “auto_asc_desc” is used to automatically select the best monotonic trend between “ascending” and “descending”. If None, then the monotonic constraint is disabled.

  • min_event_rate_diff (float, optional (default=0)) – The minimum event rate difference between consecutives bins.

  • max_pvalue (float or None, optional (default=None)) – The maximum p-value among bins. The Z-test is used to detect bins not satisfying the p-value constraint.

  • max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.

  • gamma (float, optional (default=0)) – Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization.

  • cat_cutoff (float or None, optional (default=None)) – Generate bin others with categories in which the fraction of occurrences is below the cat_cutoff value. This option is available when dtype is “categorical”.

  • cat_heuristic (bool (default=False):) – Whether to merge categories to guarantee max_n_prebins. If True, this option will be triggered when the number of categories >= max_n_prebins. This option is recommended if the number of categories, in the long run, can increase considerably, and recurrent calls to method solve are required.

  • special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

  • split_digits (int or None, optional (default=None)) – The significant digits of the split points. If split_digits is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.

  • mip_solver (str, optional (default="bop")) – The mixed-integer programming solver. Supported solvers are “bop” to choose the Google OR-Tools binary optimizer or “cbc” to choose the COIN-OR Branch-and-Cut solver CBC.

  • time_limit (int (default=100)) – The maximum time in seconds to run the optimization solver.

  • verbose (bool (default=False)) – Enable verbose output.

Notes

The parameter sketch is neglected when dtype=categorical. The sketch parameter K is only applicable when sketch=t-digest.

Both quantile sketch algorithms produce good results, being the t-digest the most accurate. Note, however, the t-digest algorithm implementation is significantly slower than the GK implementation, thus, GK is the recommended algorithm when handling partitions. Besides, GK is deterministic, therefore returning reproducible results.

add(x, y, check_input=False)

Add new data x, y to the binning sketch.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.

  • y (array-like, shape = (n_samples,)) – Target vector relative to x.

  • check_input (bool (default=False)) – Whether to check input arrays.

property binning_table

Return an instantiated binning table. Please refer to Binning table: binary target.

Returns

binning_table

Return type

BinningTable.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

information(print_level=1)

Print overview information about the options settings, problem statistics, and the solution of the computation.

Parameters

print_level (int (default=1)) – Level of details.

merge(optbsketch)

Merge current instance with another OptimalBinningSketch instance.

Parameters

optbsketch (object) – OptimalBinningSketch instance.

mergeable(optbsketch)

Check whether two OptimalBinningSketch instances can be merged.

Parameters

optbsketch (object) – OptimalBinningSketch instance.

Returns

mergeable

Return type

bool

plot_progress()

Plot divergence measure progress.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

solve()

Solve optimal binning using added data.

Returns

self – Current fitted optimal binning.

Return type

OptimalBinningSketch

property splits

List of optimal split points when dtype is set to “numerical” or list of optimal bins when dtype is set to “categorical”.

Returns

splits

Return type

numpy.ndarray

property status

The status of the underlying optimization solver.

Returns

status

Return type

str

transform(x, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)

Transform given data to Weight of Evidence (WoE) or event rate using bins from the current fitted optimal binning.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.

  • metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

x_new – Transformed array.

Return type

numpy array, shape = (n_samples,)

Notes

Transformation of data including categories not present during training return zero WoE or event rate.

GK: Greenwald-Khanna’s algorithm

class optbinning.binning.distributed.GK(eps=0.01)

Bases: object

Greenwald-Khanna’s streaming quantiles.

Parameters

eps (float (default=0.01)) – Relative error epsilon.

add(value)

Add value to sketch.

copy(gk)

Copy GK sketch.

merge(gk)

Merge sketch with another sketch gk.

merge_compress(entries=[])

Compress sketch.

mergeable(gk)

Check whether a sketch gk is mergeable.

property n

Number of records in sketch.

quantile(q)

Calculate quantile q.

Binning sketch: numerical variable - binary target

class optbinning.binning.distributed.BSketch(sketch='gk', eps=0.01, K=25, special_codes=None)

Bases: object

BSketch: binning sketch for numerical values and binary target.

Parameters
  • sketch (str, optional (default="gk")) –

    Sketch algorithm. Supported algorithms are “gk” (Greenwald-Khanna’s) and “t-digest” (Ted Dunning) algorithm. Algorithm “t-digest” relies on tdigest.

  • eps (float (default=0.01)) – Relative error epsilon.

  • K (int (default=25)) – Parameter excess growth K to compute compress threshold in t-digest.

  • special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

add(x, y, check_input=False)

Add arrays to the sketch.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.

  • y (array-like, shape = (n_samples,)) – Target vector relative to x.

  • check_input (bool (default=False)) – Whether to check input arrays.

bins(splits)

Event and non-events counts for each bin given a list of split points.

Parameters

splits (array-like, shape = (n_splits,)) – List of split points.

Returns

bins

Return type

tuple of arrays of size n_splits + 1.

merge(bsketch)

Merge current instance with another BSketch instance.

Parameters

bsketch (object) – BSketch instance.

merge_sketches()

Merge event and non-event data internal sketches.

property n

Records count.

Returns

n

Return type

int

property n_event

Event count.

Returns

n_event

Return type

int

property n_nonevent

Non-event count.

Returns

n_nonevent

Return type

int

Binning sketch: categorical variable - binary target

class optbinning.binning.distributed.BCatSketch(cat_cutoff=None, special_codes=None)

Bases: object

BCatSketch: binning sketch for categorical/nominal data and binary target.

Parameters
  • cat_cutoff (float or None, optional (default=None)) – Generate bin others with categories in which the fraction of occurrences is below the cat_cutoff value.

  • special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

add(x, y, check_input=False)

Add arrays to the sketch.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.

  • y (array-like, shape = (n_samples,)) – Target vector relative to x.

  • check_input (bool (default=False)) – Whether to check input arrays.

bins()

Event and non-events counts for each bin given the current categories.

Returns

bins

Return type

tuple of arrays.

merge(bcatsketch)

Merge current instance with another BCatSketch instance.

Parameters

bcatsketch (object) – BCatSketch instance.

property n

Records count.

Returns

n

Return type

int

property n_event

Event count.

Returns

n_event

Return type

int

property n_nonevent

Non-event count.

Returns

n_nonevent

Return type

int