Optimal binning with binary target¶
-
class
optbinning.
OptimalBinning
(name='', dtype='numerical', prebinning_method='cart', solver='cp', divergence='iv', max_n_prebins=20, min_prebin_size=0.05, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, min_bin_n_nonevent=None, max_bin_n_nonevent=None, min_bin_n_event=None, max_bin_n_event=None, monotonic_trend='auto', min_event_rate_diff=0, max_pvalue=None, max_pvalue_policy='consecutive', gamma=0, outlier_detector=None, outlier_params=None, class_weight=None, cat_cutoff=None, cat_unknown=None, user_splits=None, user_splits_fixed=None, special_codes=None, split_digits=None, mip_solver='bop', time_limit=100, verbose=False, **prebinning_kwargs)¶ Bases:
optbinning.binning.base.BaseOptimalBinning
Optimal binning of a numerical or categorical variable with respect to a binary target.
- Parameters
name (str, optional (default="")) – The variable name.
dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
prebinning_method (str, optional (default="cart")) – The pre-binning method. Supported methods are “cart” for a CART decision tree, “mdlp” for Minimum Description Length Principle (MDLP), “quantile” to generate prebins with approximately same frequency and “uniform” to generate prebins with equal width. Method “cart” uses sklearn.tree.DecisionTreeClassifier.
solver (str, optional (default="cp")) – The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” to choose a constrained programming solver or “ls” to choose LocalSolver.
divergence (str, optional (default="iv")) –
The divergence measure in the objective function to be maximized. Supported divergences are “iv” (Information Value or Jeffrey’s divergence), “js” (Jensen-Shannon), “hellinger” (Hellinger divergence) and “triangular” (triangular discrimination).
New in version 0.7.0.
max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).
min_prebin_size (float (default=0.05)) – The fraction of mininum number of records for each prebin.
min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then
min_n_bins
is a value in[0, max_n_prebins]
.max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then
max_n_bins
is a value in[0, max_n_prebins]
.min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None,
min_bin_size = min_prebin_size
.max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None,
max_bin_size = 1.0
.min_bin_n_nonevent (int or None, optional (default=None)) – The minimum number of non-event records for each bin. If None,
min_bin_n_nonevent = 1
.max_bin_n_nonevent (int or None, optional (default=None)) – The maximum number of non-event records for each bin. If None, then an unlimited number of non-event records for each bin.
min_bin_n_event (int or None, optional (default=None)) – The minimum number of event records for each bin. If None,
min_bin_n_event = 1
.max_bin_n_event (int or None, optional (default=None)) – The maximum number of event records for each bin. If None, then an unlimited number of event records for each bin.
monotonic_trend (str or None, optional (default="auto")) – The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc” to automatically determine the trend maximizing IV using a machine learning classifier, “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point, and “valley” and “valley_heuristic” to allow a valley change point. Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point, and are significantly faster for large size instances (
max_n_prebins > 20
). Trend “auto_asc_desc” is used to automatically select the best monotonic trend between “ascending” and “descending”. If None, then the monotonic constraint is disabled.min_event_rate_diff (float, optional (default=0)) – The minimum event rate difference between consecutives bins. For solver “ls”, this option currently only applies when monotonic_trend is “ascending”, “descending”, “peak_heuristic” or “valley_heuristic”.
max_pvalue (float or None, optional (default=None)) – The maximum p-value among bins. The Z-test is used to detect bins not satisfying the p-value constraint. Option supported by solvers “cp” and “mip”.
max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.
gamma (float, optional (default=0)) –
Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization. Option supported by solvers “cp” and “mip”.
New in version 0.3.0.
outlier_detector (str or None, optional (default=None)) – The outlier detection method. Supported methods are “range” to use the interquartile range based method or “zcore” to use the modified Z-score method.
outlier_params (dict or None, optional (default=None)) – Dictionary of parameters to pass to the outlier detection method.
class_weight (dict, "balanced" or None, optional (default=None)) –
Weights associated with classes in the form
{class_label: weight}
. If None, all classes are supposed to have weight one. Check sklearn.tree.DecisionTreeClassifier.cat_cutoff (float or None, optional (default=None)) – Generate bin others with categories in which the fraction of occurrences is below the
cat_cutoff
value. This option is available whendtype
is “categorical”.cat_unknown (float, str or None (default=None)) –
The assigned value to the unobserved categories in training but occurring during transform.
If None, the assigned value to an unknown category follows this rule:
if transform metric == ‘woe’ then woe(mean event rate) = 0
if transform metric == ‘event_rate’ then mean event rate
if transform metric == ‘indices’ then -1
if transform metric == ‘bins’ then ‘unknown’
New in version 0.17.1.
user_splits (array-like or None, optional (default=None)) – The list of pre-binning split points when
dtype
is “numerical” or the list of prebins whendtype
is “categorical”.user_splits_fixed (array-like or None (default=None)) –
The list of pre-binning split points that must be fixed.
New in version 0.5.0.
special_codes (array-like, dict or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
split_digits (int or None, optional (default=None)) – The significant digits of the split points. If
split_digits
is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.mip_solver (str, optional (default="bop")) – The mixed-integer programming solver. Supported solvers are “bop” to choose the Google OR-Tools binary optimizer or “cbc” to choose the COIN-OR Branch-and-Cut solver CBC.
time_limit (int (default=100)) – The maximum time in seconds to run the optimization solver.
verbose (bool (default=False)) – Enable verbose output.
**prebinning_kwargs (keyword arguments) –
The pre-binning keyword arguments.
New in version 0.6.1.
Notes
The parameter values
max_n_prebins
andmin_prebin_size
control complexity and memory usage. The default values generally produce quality results, however, some improvement can be achieved by increasingmax_n_prebins
and/or decreasingmin_prebin_size
. A parameter valuemax_n_prebins
greater than 100 is only recommended ifsolver="ls"
.The pre-binning refinement phase guarantee that no prebin has either zero counts of non-events or events by merging those pure prebins. Pure bins produce infinity WoE and IV measures.
The mathematical formulation when
solver="ls"
does not currently support themax_pvalue
constraint.-
property
binning_table
¶ Return an instantiated binning table. Please refer to Binning table: binary target.
- Returns
binning_table
- Return type
-
fit
(x, y, sample_weight=None, check_input=False)¶ Fit the optimal binning according to the given training data.
- Parameters
x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Target vector relative to x.
sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Only applied if
prebinning_method="cart"
.check_input (bool (default=False)) – Whether to check input arrays.
- Returns
self – Fitted optimal binning.
- Return type
-
fit_transform
(x, y, sample_weight=None, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶ Fit the optimal binning according to the given training data, then transform it.
- Parameters
x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Target vector relative to x.
sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Only applied if
prebinning_method="cart"
.metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate, and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when
metric="bins"
.check_input (bool (default=False)) – Whether to check input arrays.
- Returns
x_new – Transformed array.
- Return type
numpy array, shape = (n_samples,)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
-
information
(print_level=1)¶ Print overview information about the options settings, problem statistics, and the solution of the computation.
- Parameters
print_level (int (default=1)) – Level of details.
-
read_json
(path)¶ Read json file containing split points and set them as the new split points.
- Parameters
path (The path of the json file.) –
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
-
property
splits
¶ List of optimal split points when
dtype
is set to “numerical” or list of optimal bins whendtype
is set to “categorical”.- Returns
splits
- Return type
numpy.ndarray
-
property
status
¶ The status of the underlying optimization solver.
- Returns
status
- Return type
str
-
to_json
(path)¶ Save optimal bins and/or splits points and transformation depending on the target type.
- Parameters
path (The path where the json is going to be saved.) –
-
transform
(x, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶ Transform given data to Weight of Evidence (WoE) or event rate using bins from the fitted optimal binning.
- Parameters
x (array-like, shape = (n_samples,)) – Training vector, where n_samples is the number of samples.
metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when
metric="bins"
.check_input (bool (default=False)) – Whether to check input arrays.
- Returns
x_new – Transformed array.
- Return type
numpy array, shape = (n_samples,)
Notes
Transformation of data including categories not present during training return zero WoE or event rate.