Optimal binning sketch with binary target¶
Introduction¶
The optimal binning is the constrained discretization of a numerical feature into bins given a binary target, maximizing a statistic such as Jeffrey’s divergence or Gini. Binning is a data preprocessing technique commonly used in binary classification, but the current list of existing binning algorithms supporting constraints lacks a method to handle streaming data. The new class OptimalBinningSketch implements a new scalable, memory-efficient and robust algorithm for performing optimal binning in the streaming settings. Algorithmic details are discussed in http://gnpalencia.org/blog/2020/binning_data_streams/.
Algorithms¶
OptimalBinningSketch¶
-
class
optbinning.binning.distributed.
OptimalBinningSketch
(name='', dtype='numerical', sketch='gk', eps=0.0001, K=25, solver='cp', divergence='iv', max_n_prebins=20, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, min_bin_n_nonevent=None, max_bin_n_nonevent=None, min_bin_n_event=None, max_bin_n_event=None, monotonic_trend='auto', min_event_rate_diff=0, max_pvalue=None, max_pvalue_policy='consecutive', gamma=0, cat_cutoff=None, cat_unknown=None, cat_heuristic=False, special_codes=None, split_digits=None, mip_solver='bop', time_limit=100, verbose=False)¶ Bases:
optbinning.binning.distributed.base.BaseSketch
,sklearn.base.BaseEstimator
Optimal binning over data streams of a numerical or categorical variable with respect to a binary target.
- Parameters
name (str, optional (default="")) – The variable name.
dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
sketch (str, optional (default="gk")) – Sketch algorithm. Supported algorithms are “gk” (Greenwald-Khanna’s) and “t-digest” (Ted Dunning) algorithm. Algorithm “t-digest” relies on tdigest.
eps (float, optional (default=1e-4)) – Relative error epsilon. For
sketch="gk"
this is the absolute precision, whereas forsketch="t-digest"
is the relative precision.K (int, optional (default=25)) – Parameter excess growth K to compute compress threshold in t-digest.
solver (str, optional (default="cp")) – The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” to choose a constrained programming solver or “ls” to choose LocalSolver.
divergence (str, optional (default="iv")) – The divergence measure in the objective function to be maximized. Supported divergences are “iv” (Information Value or Jeffrey’s divergence), “js” (Jensen-Shannon), “hellinger” (Hellinger divergence) and “triangular” (triangular discrimination).
max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).
min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then
min_n_bins
is a value in[0, max_n_prebins]
.max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then
max_n_bins
is a value in[0, max_n_prebins]
.min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None,
min_bin_size = min_prebin_size
.max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None,
max_bin_size = 1.0
.min_bin_n_nonevent (int or None, optional (default=None)) – The minimum number of non-event records for each bin. If None,
min_bin_n_nonevent = 1
.max_bin_n_nonevent (int or None, optional (default=None)) – The maximum number of non-event records for each bin. If None, then an unlimited number of non-event records for each bin.
min_bin_n_event (int or None, optional (default=None)) – The minimum number of event records for each bin. If None,
min_bin_n_event = 1
.max_bin_n_event (int or None, optional (default=None)) – The maximum number of event records for each bin. If None, then an unlimited number of event records for each bin.
monotonic_trend (str or None, optional (default="auto")) – The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc” to automatically determine the trend maximizing IV using a machine learning classifier, “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point, and “valley” and “valley_heuristic” to allow a valley change point. Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point, and are significantly faster for large size instances (
max_n_prebins > 20
). Trend “auto_asc_desc” is used to automatically select the best monotonic trend between “ascending” and “descending”. If None, then the monotonic constraint is disabled.min_event_rate_diff (float, optional (default=0)) – The minimum event rate difference between consecutives bins.
max_pvalue (float or None, optional (default=None)) – The maximum p-value among bins. The Z-test is used to detect bins not satisfying the p-value constraint.
max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.
gamma (float, optional (default=0)) – Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization.
cat_cutoff (float or None, optional (default=None)) – Generate bin others with categories in which the fraction of occurrences is below the
cat_cutoff
value. This option is available whendtype
is “categorical”.cat_heuristic (bool (default=False):) – Whether to merge categories to guarantee max_n_prebins. If True, this option will be triggered when the number of categories >= max_n_prebins. This option is recommended if the number of categories, in the long run, can increase considerably, and recurrent calls to method
solve
are required.special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
split_digits (int or None, optional (default=None)) – The significant digits of the split points. If
split_digits
is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.mip_solver (str, optional (default="bop")) – The mixed-integer programming solver. Supported solvers are “bop” to choose the Google OR-Tools binary optimizer or “cbc” to choose the COIN-OR Branch-and-Cut solver CBC.
time_limit (int (default=100)) – The maximum time in seconds to run the optimization solver.
verbose (bool (default=False)) – Enable verbose output.
Notes
The parameter
sketch
is neglected whendtype=categorical
. The sketch parameterK
is only applicable whensketch=t-digest
.Both quantile sketch algorithms produce good results, being the t-digest the most accurate. Note, however, the t-digest algorithm implementation is significantly slower than the GK implementation, thus, GK is the recommended algorithm when handling partitions. Besides, GK is deterministic, therefore returning reproducible results.
-
add
(x, y, check_input=False)¶ Add new data x, y to the binning sketch.
- Parameters
x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Target vector relative to x.
check_input (bool (default=False)) – Whether to check input arrays.
-
property
binning_table
¶ Return an instantiated binning table. Please refer to Binning table: binary target.
- Returns
binning_table
- Return type
BinningTable.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
-
information
(print_level=1)¶ Print overview information about the options settings, problem statistics, and the solution of the computation.
- Parameters
print_level (int (default=1)) – Level of details.
-
merge
(optbsketch)¶ Merge current instance with another OptimalBinningSketch instance.
- Parameters
optbsketch (object) – OptimalBinningSketch instance.
-
mergeable
(optbsketch)¶ Check whether two OptimalBinningSketch instances can be merged.
- Parameters
optbsketch (object) – OptimalBinningSketch instance.
- Returns
mergeable
- Return type
bool
-
plot_progress
()¶ Plot divergence measure progress.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
-
solve
()¶ Solve optimal binning using added data.
- Returns
self – Current fitted optimal binning.
- Return type
-
property
splits
¶ List of optimal split points when
dtype
is set to “numerical” or list of optimal bins whendtype
is set to “categorical”.- Returns
splits
- Return type
numpy.ndarray
-
property
status
¶ The status of the underlying optimization solver.
- Returns
status
- Return type
str
-
transform
(x, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶ Transform given data to Weight of Evidence (WoE) or event rate using bins from the current fitted optimal binning.
- Parameters
x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when
metric="bins"
.check_input (bool (default=False)) – Whether to check input arrays.
- Returns
x_new – Transformed array.
- Return type
numpy array, shape = (n_samples,)
Notes
Transformation of data including categories not present during training return zero WoE or event rate.
GK: Greenwald-Khanna’s algorithm¶
-
class
optbinning.binning.distributed.
GK
(eps=0.01)¶ Bases:
object
Greenwald-Khanna’s streaming quantiles.
- Parameters
eps (float (default=0.01)) – Relative error epsilon.
-
add
(value)¶ Add value to sketch.
-
copy
(gk)¶ Copy GK sketch.
-
merge
(gk)¶ Merge sketch with another sketch gk.
-
merge_compress
(entries=[])¶ Compress sketch.
-
mergeable
(gk)¶ Check whether a sketch gk is mergeable.
-
property
n
¶ Number of records in sketch.
-
quantile
(q)¶ Calculate quantile q.
Binning sketch: numerical variable - binary target¶
-
class
optbinning.binning.distributed.
BSketch
(sketch='gk', eps=0.01, K=25, special_codes=None)¶ Bases:
object
BSketch: binning sketch for numerical values and binary target.
- Parameters
sketch (str, optional (default="gk")) –
Sketch algorithm. Supported algorithms are “gk” (Greenwald-Khanna’s) and “t-digest” (Ted Dunning) algorithm. Algorithm “t-digest” relies on tdigest.
eps (float (default=0.01)) – Relative error epsilon.
K (int (default=25)) – Parameter excess growth K to compute compress threshold in t-digest.
special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
-
add
(x, y, check_input=False)¶ Add arrays to the sketch.
- Parameters
x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Target vector relative to x.
check_input (bool (default=False)) – Whether to check input arrays.
-
bins
(splits)¶ Event and non-events counts for each bin given a list of split points.
- Parameters
splits (array-like, shape = (n_splits,)) – List of split points.
- Returns
bins
- Return type
tuple of arrays of size n_splits + 1.
-
merge
(bsketch)¶ Merge current instance with another BSketch instance.
- Parameters
bsketch (object) – BSketch instance.
-
merge_sketches
()¶ Merge event and non-event data internal sketches.
-
property
n
¶ Records count.
- Returns
n
- Return type
int
-
property
n_event
¶ Event count.
- Returns
n_event
- Return type
int
-
property
n_nonevent
¶ Non-event count.
- Returns
n_nonevent
- Return type
int
Binning sketch: categorical variable - binary target¶
-
class
optbinning.binning.distributed.
BCatSketch
(cat_cutoff=None, special_codes=None)¶ Bases:
object
BCatSketch: binning sketch for categorical/nominal data and binary target.
- Parameters
cat_cutoff (float or None, optional (default=None)) – Generate bin others with categories in which the fraction of occurrences is below the
cat_cutoff
value.special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
-
add
(x, y, check_input=False)¶ Add arrays to the sketch.
- Parameters
x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Target vector relative to x.
check_input (bool (default=False)) – Whether to check input arrays.
-
bins
()¶ Event and non-events counts for each bin given the current categories.
- Returns
bins
- Return type
tuple of arrays.
-
merge
(bcatsketch)¶ Merge current instance with another BCatSketch instance.
- Parameters
bcatsketch (object) – BCatSketch instance.
-
property
n
¶ Records count.
- Returns
n
- Return type
int
-
property
n_event
¶ Event count.
- Returns
n_event
- Return type
int
-
property
n_nonevent
¶ Non-event count.
- Returns
n_nonevent
- Return type
int