Optimal binning 2D with binary target

class optbinning.OptimalBinning2D(name_x='', name_y='', dtype_x='numerical', dtype_y='numerical', prebinning_method='cart', strategy='grid', solver='cp', divergence='iv', max_n_prebins_x=5, max_n_prebins_y=5, min_prebin_size_x=0.05, min_prebin_size_y=0.05, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, min_bin_n_nonevent=None, max_bin_n_nonevent=None, min_bin_n_event=None, max_bin_n_event=None, monotonic_trend_x=None, monotonic_trend_y=None, min_event_rate_diff_x=0, min_event_rate_diff_y=0, gamma=0, special_codes_x=None, special_codes_y=None, split_digits=None, n_jobs=1, time_limit=100, verbose=False)

Bases: optbinning.binning.binning.OptimalBinning

Optimal binning of two numerical variables with respect to a binary target.

Parameters
  • name_x (str, optional (default="")) – The name of variable x.

  • name_y (str, optional (default="")) – The name of variable y.

  • dtype_x (str, optional (default="numerical")) – The data type of variable x. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.

  • dtype_y (str, optional (default="numerical")) – The data type of variable y. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.

  • prebinning_method (str, optional (default="cart")) – The pre-binning method. Supported methods are “cart” for a CART decision tree, “mdlp” for Minimum Description Length Principle (MDLP), “quantile” to generate prebins with approximately same frequency and “uniform” to generate prebins with equal width. Method “cart” uses sklearn.tree.DecistionTreeClassifier.

  • strategy (str, optional (default="grid")) – The strategy used to create the initial prebinning 2D after computing prebinning splits on the x and y axis. The strategy “grid” creates a prebinning 2D with n_prebins_x times n_prebins_y elements. The strategy “cart” (experimental) reduces the number of elements by pruning. The latter is recommended when the number of prebins is large.

  • solver (str, optional (default="cp")) – The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixed-integer programming solver, and “cp” to choose a constrained programming solver.

  • divergence (str, optional (default="iv")) – The divergence measure in the objective function to be maximized. Supported divergences are “iv” (Information Value or Jeffrey’s divergence), “js” (Jensen-Shannon), “hellinger” (Hellinger divergence) and “triangular” (triangular discrimination).

  • max_n_prebins_x (int (default=5)) – The maximum number of bins on variable x after pre-binning (prebins).

  • max_n_prebins_y (int (default=5)) – The maximum number of bins on variable y after pre-binning (prebins).

  • min_prebin_size_x (float (default=0.05)) – The fraction of mininum number of records for each prebin on variable x.

  • min_prebin_size_y (float (default=0.05)) – The fraction of mininum number of records for each prebin on variable y.

  • min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then min_n_bins is a value in [0, max_n_prebins].

  • max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then max_n_bins is a value in [0, max_n_prebins].

  • min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None, min_bin_size = min_prebin_size.

  • max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None, max_bin_size = 1.0.

  • min_bin_n_nonevent (int or None, optional (default=None)) – The minimum number of non-event records for each bin. If None, min_bin_n_nonevent = 1.

  • max_bin_n_nonevent (int or None, optional (default=None)) – The maximum number of non-event records for each bin. If None, then an unlimited number of non-event records for each bin.

  • min_bin_n_event (int or None, optional (default=None)) – The minimum number of event records for each bin. If None, min_bin_n_event = 1.

  • max_bin_n_event (int or None, optional (default=None)) – The maximum number of event records for each bin. If None, then an unlimited number of event records for each bin.

  • monotonic_trend_x (str or None, optional (default=None)) – The event rate monotonic trend on the x axis. Supported trends are “ascending”, and “descending”. If None, then the monotonic constraint is disabled.

  • monotonic_trend_y (str or None, optional (default=None)) – The event rate monotonic trend on the y axis. Supported trends are “ascending”, and “descending”. If None, then the monotonic constraint is disabled.

  • min_event_rate_diff_x (float, optional (default=0)) – The minimum event rate difference between consecutives bins on the x axis.

  • min_event_rate_diff_y (float, optional (default=0)) – The minimum event rate difference between consecutives bins on the y axis.

  • gamma (float, optional (default=0)) – Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization.

  • special_codes_x (array-like or None, optional (default=None)) – List of special codes for the variable x. Use special codes to specify the data values that must be treated separately.

  • special_codes_y (array-like or None, optional (default=None)) – List of special codes for the variable y. Use special codes to specify the data values that must be treated separately.

  • split_digits (int or None, optional (default=None)) – The significant digits of the split points. If split_digits is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.

  • n_jobs (int or None, optional (default=None)) – Number of cores to run in parallel while binning variables. None means 1 core. -1 means using all processors.

  • time_limit (int (default=100)) – The maximum time in seconds to run the optimization solver.

  • verbose (bool (default=False)) – Enable verbose output.

property binning_table

Return an instantiated binning table. Please refer to Binning table: binary target.

Returns

binning_table

Return type

BinningTable

fit(x, y, z, check_input=False)

Fit the optimal binning 2D according to the given training data.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector x, where n_samples is the number of samples.

  • y (array-like, shape = (n_samples,)) – Training vector y, where n_samples is the number of samples.

  • z (array-like, shape = (n_samples,)) – Target vector relative to x and y.

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

self – Fitted optimal binning 2D.

Return type

OptimalBinning2D

fit_transform(x, y, z, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)

Fit the optimal binning 2D according to the given training data, then transform it.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector x, where n_samples is the number of samples.

  • y (array-like, shape = (n_samples,)) – Training vector y, where n_samples is the number of samples.

  • z (array-like, shape = (n_samples,)) – Target vector relative to x and y.

  • metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate, and any numerical value.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

z_new – Transformed array.

Return type

numpy array, shape = (n_samples,)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

information(print_level=1)

Print overview information about the options settings, problem statistics, and the solution of the computation.

Parameters

print_level (int (default=1)) – Level of details.

read_json(path)

Read json file containing split points and set them as the new split points.

Parameters

path (The path of the json file.) –

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

property splits

List of optimal split points and bins for axis x and y.

Returns

splits

Return type

(numpy.ndarray, numpy.ndarray)

property status

The status of the underlying optimization solver.

Returns

status

Return type

str

to_json(path)

Save optimal bins and/or splits points and transformation depending on the target type.

Parameters

path (The path where the json is going to be saved.) –

transform(x, y, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)

Transform given data to Weight of Evidence (WoE) or event rate using bins from the fitted optimal binning 2D.

Parameters
  • x (array-like, shape = (n_samples,)) – Training vector x, where n_samples is the number of samples.

  • y (array-like, shape = (n_samples,)) – Training vector y, where n_samples is the number of samples.

  • metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.

  • metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.

  • metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.

  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".

  • check_input (bool (default=False)) – Whether to check input arrays.

Returns

z_new – Transformed array.

Return type

numpy array, shape = (n_samples,)