Optimal binning 2D with continuous target¶

class optbinning.ContinuousOptimalBinning2D(name_x='', name_y='', dtype_x='numerical', dtype_y='numerical', prebinning_method='cart', strategy='grid', solver='cp', max_n_prebins_x=5, max_n_prebins_y=5, min_prebin_size_x=0.05, min_prebin_size_y=0.05, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, monotonic_trend_x=None, monotonic_trend_y=None, min_mean_diff_x=0, min_mean_diff_y=0, gamma=0, special_codes_x=None, special_codes_y=None, split_digits=None, n_jobs=1, time_limit=100, verbose=False)¶

Bases: optbinning.binning.multidimensional.binning_2d.OptimalBinning2D

Optimal binning of two numerical variables with respect to a continuous target.

Parameters

name_x (str, optional (default="")) – The name of variable x.
name_y (str, optional (default="")) – The name of variable y.
dtype_x (str, optional (default="numerical")) – The data type of variable x. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
dtype_y (str, optional (default="numerical")) – The data type of variable y. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
prebinning_method (str, optional (default="cart")) – The pre-binning method. Supported methods are “cart” for a CART decision tree, “mdlp” for Minimum Description Length Principle (MDLP), “quantile” to generate prebins with approximately same frequency and “uniform” to generate prebins with equal width. Method “cart” uses sklearn.tree.DecisionTreeRegressor.
strategy (str, optional (default="grid")) – The strategy used to create the initial prebinning 2D after computing prebinning splits on the x and y axis. The strategy “grid” creates a prebinning 2D with n_prebins_x times n_prebins_y elements. The strategy “cart” (experimental) reduces the number of elements by pruning. The latter is recommended when the number of prebins is large.
solver (str, optional (default="cp")) – The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixed-integer programming solver, and “cp” to choose a constrained programming solver.
divergence (str, optional (default="iv")) – The divergence measure in the objective function to be maximized. Supported divergences are “iv” (Information Value or Jeffrey’s divergence), “js” (Jensen-Shannon), “hellinger” (Hellinger divergence) and “triangular” (triangular discrimination).
max_n_prebins_x (int (default=5)) – The maximum number of bins on variable x after pre-binning (prebins).
max_n_prebins_y (int (default=5)) – The maximum number of bins on variable y after pre-binning (prebins).
min_prebin_size_x (float (default=0.05)) – The fraction of mininum number of records for each prebin on variable x.
min_prebin_size_y (float (default=0.05)) – The fraction of mininum number of records for each prebin on variable y.
min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then min_n_bins is a value in [0, max_n_prebins].
max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then max_n_bins is a value in [0, max_n_prebins].
min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None, min_bin_size = min_prebin_size.
max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None, max_bin_size = 1.0.
monotonic_trend_x (str or None, optional (default=None)) – The mean monotonic trend on the x axis. Supported trends are “ascending”, and “descending”. If None, then the monotonic constraint is disabled.
monotonic_trend_y (str or None, optional (default=None)) – The mean monotonic trend on the y axis. Supported trends are “ascending”, and “descending”. If None, then the monotonic constraint is disabled.
min_mean_diff_x (float, optional (default=0)) – The minimum mean difference between consecutives bins on the x axis.
min_mean_diff_y (float, optional (default=0)) – The minimum mean difference between consecutives bins on the y axis.
gamma (float, optional (default=0)) – Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization.
special_codes_x (array-like or None, optional (default=None)) – List of special codes for the variable x. Use special codes to specify the data values that must be treated separately.
special_codes_y (array-like or None, optional (default=None)) – List of special codes for the variable y. Use special codes to specify the data values that must be treated separately.
split_digits (int or None, optional (default=None)) – The significant digits of the split points. If split_digits is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.
n_jobs (int or None, optional (default=None)) – Number of cores to run in parallel while binning variables. None means 1 core. -1 means using all processors.
time_limit (int (default=100)) – The maximum time in seconds to run the optimization solver.
verbose (bool (default=False)) – Enable verbose output.

property binning_table¶

Return an instantiated binning table. Please refer to Binning table: binary target.

Returns: binning_table
Return type: BinningTable

fit(x, y, z, check_input=False)¶

Fit the optimal binning 2D according to the given training data.

Parameters

x (array-like, shape = (n_samples,)) – Training vector x, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Training vector y, where n_samples is the number of samples.
z (array-like, shape = (n_samples,)) – Target vector relative to x and y.
check_input (bool (default=False)) – Whether to check input arrays.

Returns

self – Fitted optimal binning 2D.

Return type

ContinuousOptimalBinning2D

fit_transform(x, y, z, metric='mean', metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶

Fit the optimal binning 2D according to the given training data, then transform it.

Parameters

x (array-like, shape = (n_samples,)) – Training vector x, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Training vector y, where n_samples is the number of samples.
z (array-like, shape = (n_samples,)) – Target vector relative to x and y.
metric (str (default="mean")) – The metric used to transform the input vector. Supported metrics are “mean” to choose the mean, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate, and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".
check_input (bool (default=False)) – Whether to check input arrays.

Returns

z_new – Transformed array.

Return type

numpy array, shape = (n_samples,)

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

information(print_level=1)¶

Print overview information about the options settings, problem statistics, and the solution of the computation.

Parameters: print_level (int (default=1)) – Level of details.

read_json(path)¶

Read json file containing split points and set them as the new split points.

Parameters: path (The path of the json file.) –

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

property splits¶

List of optimal split points and bins for axis x and y.

Returns: splits
Return type: (numpy.ndarray, numpy.ndarray)

property status¶

The status of the underlying optimization solver.

Returns: status
Return type: str

to_dict()¶

Convert optimal bins and/or splits points and transformation depending on the target type to dictionary.

Returns: opt_bin_dict
Return type: dict

to_json(path)¶

Save optimal bins and/or splits points and transformation depending on the target type.

Parameters: path (The path where the json is going to be saved.) –

transform(x, y, metric='mean', metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶

Transform given data to mean using bins from the fitted optimal binning 2D.

Parameters

x (array-like, shape = (n_samples,)) – Training vector x, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Training vector y, where n_samples is the number of samples.
metric (str (default="mean")) – The metric used to transform the input vector. Supported metrics are “mean” to choose the mean, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when metric="bins".
check_input (bool (default=False)) – Whether to check input arrays.

Returns

z_new – Transformed array.

Return type

numpy array, shape = (n_samples,)