Optimal binning sketch with binary target¶

Introduction¶

The optimal binning is the constrained discretization of a numerical feature into bins given a binary target, maximizing a statistic such as Jeffrey’s divergence or Gini. Binning is a data preprocessing technique commonly used in binary classification, but the current list of existing binning algorithms supporting constraints lacks a method to handle streaming data. The new class OptimalBinningSketch implements a new scalable, memory-efficient and robust algorithm for performing optimal binning in the streaming settings. Algorithmic details are discussed in http://gnpalencia.org/blog/2020/binning_data_streams/.

Algorithms¶

OptimalBinningSketch¶

class optbinning.binning.distributed.OptimalBinningSketch(name='', dtype='numerical', sketch='gk', eps=0.0001, K=25, solver='cp', divergence='iv', max_n_prebins=20, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, min_bin_n_nonevent=None, max_bin_n_nonevent=None, min_bin_n_event=None, max_bin_n_event=None, monotonic_trend='auto', min_event_rate_diff=0, max_pvalue=None, max_pvalue_policy='consecutive', gamma=0, cat_cutoff=None, cat_unknown=None, cat_heuristic=False, special_codes=None, split_digits=None, mip_solver='bop', time_limit=100, verbose=False)¶

Bases: optbinning.binning.distributed.base.BaseSketch, sklearn.base.BaseEstimator

Optimal binning over data streams of a numerical or categorical variable with respect to a binary target.

Parameters

name (str, optional (default="")) – The variable name.
dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
sketch (str, optional (default="gk")) – Sketch algorithm. Supported algorithms are “gk” (Greenwald-Khanna’s) and “t-digest” (Ted Dunning) algorithm. Algorithm “t-digest” relies on tdigest.
eps (float, optional (default=1e-4)) – Relative error epsilon. For sketch="gk" this is the absolute precision, whereas for sketch="t-digest" is the relative precision.
K (int, optional (default=25)) – Parameter excess growth K to compute compress threshold in t-digest.
solver (str, optional (default="cp")) – The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” to choose a constrained programming solver or “ls” to choose LocalSolver.
divergence (str, optional (default="iv")) – The divergence measure in the objective function to be maximized. Supported divergences are “iv” (Information Value or Jeffrey’s divergence), “js” (Jensen-Shannon), “hellinger” (Hellinger divergence) and “triangular” (triangular discrimination).
max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).
min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then min_n_bins is a value in [0, max_n_prebins].
max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then max_n_bins is a value in [0, max_n_prebins].
min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None, min_bin_size = min_prebin_size.
max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None, max_bin_size = 1.0.
min_bin_n_nonevent (int or None, optional (default=None)) – The minimum number of non-event records for each bin. If None, min_bin_n_nonevent = 1.
max_bin_n_nonevent (int or None, optional (default=None)) – The maximum number of non-event records for each bin. If None, then an unlimited number of non-event records for each bin.
min_bin_n_event (int or None, optional (default=None)) – The minimum number of event records for each bin. If None, min_bin_n_event = 1.
max_bin_n_event (int or None, optional (default=None)) – The maximum number of event records for each bin. If None, then an unlimited number of event records for each bin.
monotonic_trend (str or None, optional (default="auto")) – The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc” to automatically determine the trend maximizing IV using a machine learning classifier, “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point, and “valley” and “valley_heuristic” to allow a valley change point. Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point, and are significantly faster for large size instances (max_n_prebins > 20). Trend “auto_asc_desc” is used to automatically select the best monotonic trend between “ascending” and “descending”. If None, then the monotonic constraint is disabled.
min_event_rate_diff (float, optional (default=0)) – The minimum event rate difference between consecutives bins.
max_pvalue (float or None, optional (default=None)) – The maximum p-value among bins. The Z-test is used to detect bins not satisfying the p-value constraint.
max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.
gamma (float, optional (default=0)) – Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization.
cat_cutoff (float or None, optional (default=None)) – Generate bin others with categories in which the fraction of occurrences is below the cat_cutoff value. This option is available when dtype is “categorical”.
cat_heuristic (bool (default=False):) – Whether to merge categories to guarantee max_n_prebins. If True, this option will be triggered when the number of categories >= max_n_prebins. This option is recommended if the number of categories, in the long run, can increase considerably, and recurrent calls to method solve are required.
special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
split_digits (int or None, optional (default=None)) – The significant digits of the split points. If split_digits is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.
mip_solver (str, optional (default="bop")) – The mixed-integer programming solver. Supported solvers are “bop” to choose the Google OR-Tools binary optimizer or “cbc” to choose the COIN-OR Branch-and-Cut solver CBC.
time_limit (int (default=100)) – The maximum time in seconds to run the optimization solver.
verbose (bool (default=False)) – Enable verbose output.

Notes

The parameter sketch is neglected when dtype=categorical. The sketch parameter K is only applicable when sketch=t-digest.

Both quantile sketch algorithms produce good results, being the t-digest the most accurate. Note, however, the t-digest algorithm implementation is significantly slower than the GK implementation, thus, GK is the recommended algorithm when handling partitions. Besides, GK is deterministic, therefore returning reproducible results.

add(x, y, check_input=False)¶

Add new data x, y to the binning sketch.

Parameters

x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Target vector relative to x.
check_input (bool (default=False)) – Whether to check input arrays.

property binning_table¶

Return an instantiated binning table. Please refer to Binning table: binary target.

Returns: binning_table
Return type: BinningTable.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

information(print_level=1)¶

Print overview information about the options settings, problem statistics, and the solution of the computation.

Parameters: print_level (int (default=1)) – Level of details.

merge(optbsketch)¶

Merge current instance with another OptimalBinningSketch instance.

Parameters: optbsketch (object) – OptimalBinningSketch instance.

mergeable(optbsketch)¶

Check whether two OptimalBinningSketch instances can be merged.

Parameters: optbsketch (object) – OptimalBinningSketch instance.
Returns: mergeable
Return type: bool

plot_progress()¶: Plot divergence measure progress.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

solve()¶

Solve optimal binning using added data.

Returns: self – Current fitted optimal binning.
Return type: OptimalBinningSketch

property splits¶

List of optimal split points when dtype is set to “numerical” or list of optimal bins when dtype is set to “categorical”.

Returns: splits
Return type: numpy.ndarray

property status¶

The status of the underlying optimization solver.

Returns: status
Return type: str

transform(x, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶

Transform given data to Weight of Evidence (WoE) or event rate using bins from the current fitted optimal binning.

Parameters

x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".
check_input (bool (default=False)) – Whether to check input arrays.

Returns

x_new – Transformed array.

Return type

numpy array, shape = (n_samples,)

Notes

Transformation of data including categories not present during training return zero WoE or event rate.

GK: Greenwald-Khanna’s algorithm¶

class optbinning.binning.distributed.GK(eps=0.01)¶

Bases: object

Greenwald-Khanna’s streaming quantiles.

Parameters: eps (float (default=0.01)) – Relative error epsilon.

add(value)¶: Add value to sketch.

copy(gk)¶: Copy GK sketch.

merge(gk)¶: Merge sketch with another sketch gk.

merge_compress(entries=[])¶: Compress sketch.

mergeable(gk)¶: Check whether a sketch gk is mergeable.

property n¶: Number of records in sketch.

quantile(q)¶: Calculate quantile q.

Binning sketch: numerical variable - binary target¶

class optbinning.binning.distributed.BSketch(sketch='gk', eps=0.01, K=25, special_codes=None)¶

Bases: object

BSketch: binning sketch for numerical values and binary target.

Parameters

sketch (str, optional (default="gk")) –
Sketch algorithm. Supported algorithms are “gk” (Greenwald-Khanna’s) and “t-digest” (Ted Dunning) algorithm. Algorithm “t-digest” relies on tdigest.
eps (float (default=0.01)) – Relative error epsilon.
K (int (default=25)) – Parameter excess growth K to compute compress threshold in t-digest.
special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

add(x, y, check_input=False)¶

Add arrays to the sketch.

Parameters

x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Target vector relative to x.
check_input (bool (default=False)) – Whether to check input arrays.

bins(splits)¶

Event and non-events counts for each bin given a list of split points.

Parameters: splits (array-like, shape = (n_splits,)) – List of split points.
Returns: bins
Return type: tuple of arrays of size n_splits + 1.

merge(bsketch)¶

Merge current instance with another BSketch instance.

Parameters: bsketch (object) – BSketch instance.

merge_sketches()¶: Merge event and non-event data internal sketches.

property n¶

Records count.

Returns: n
Return type: int

property n_event¶

Event count.

Returns: n_event
Return type: int

property n_nonevent¶

Non-event count.

Returns: n_nonevent
Return type: int

Binning sketch: categorical variable - binary target¶

class optbinning.binning.distributed.BCatSketch(cat_cutoff=None, special_codes=None)¶

Bases: object

BCatSketch: binning sketch for categorical/nominal data and binary target.

Parameters

cat_cutoff (float or None, optional (default=None)) – Generate bin others with categories in which the fraction of occurrences is below the cat_cutoff value.
special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

add(x, y, check_input=False)¶

Add arrays to the sketch.

Parameters

x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Target vector relative to x.
check_input (bool (default=False)) – Whether to check input arrays.

bins()¶

Event and non-events counts for each bin given the current categories.

Returns: bins
Return type: tuple of arrays.

merge(bcatsketch)¶

Merge current instance with another BCatSketch instance.

Parameters: bcatsketch (object) – BCatSketch instance.

property n¶

Records count.

Returns: n
Return type: int

property n_event¶

Event count.

Returns: n_event
Return type: int

property n_nonevent¶

Non-event count.

Returns: n_nonevent
Return type: int