Optimal binning 2D with continuous target¶
-
class
optbinning.
ContinuousOptimalBinning2D
(name_x='', name_y='', dtype_x='numerical', dtype_y='numerical', prebinning_method='cart', strategy='grid', solver='cp', max_n_prebins_x=5, max_n_prebins_y=5, min_prebin_size_x=0.05, min_prebin_size_y=0.05, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, monotonic_trend_x=None, monotonic_trend_y=None, min_mean_diff_x=0, min_mean_diff_y=0, gamma=0, special_codes_x=None, special_codes_y=None, split_digits=None, n_jobs=1, time_limit=100, verbose=False)¶ Bases:
optbinning.binning.multidimensional.binning_2d.OptimalBinning2D
Optimal binning of two numerical variables with respect to a continuous target.
- Parameters
name_x (str, optional (default="")) – The name of variable x.
name_y (str, optional (default="")) – The name of variable y.
dtype_x (str, optional (default="numerical")) – The data type of variable x. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
dtype_y (str, optional (default="numerical")) – The data type of variable y. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
prebinning_method (str, optional (default="cart")) – The pre-binning method. Supported methods are “cart” for a CART decision tree, “mdlp” for Minimum Description Length Principle (MDLP), “quantile” to generate prebins with approximately same frequency and “uniform” to generate prebins with equal width. Method “cart” uses sklearn.tree.DecisionTreeRegressor.
strategy (str, optional (default="grid")) – The strategy used to create the initial prebinning 2D after computing prebinning splits on the x and y axis. The strategy “grid” creates a prebinning 2D with n_prebins_x times n_prebins_y elements. The strategy “cart” (experimental) reduces the number of elements by pruning. The latter is recommended when the number of prebins is large.
solver (str, optional (default="cp")) – The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixed-integer programming solver, and “cp” to choose a constrained programming solver.
divergence (str, optional (default="iv")) – The divergence measure in the objective function to be maximized. Supported divergences are “iv” (Information Value or Jeffrey’s divergence), “js” (Jensen-Shannon), “hellinger” (Hellinger divergence) and “triangular” (triangular discrimination).
max_n_prebins_x (int (default=5)) – The maximum number of bins on variable x after pre-binning (prebins).
max_n_prebins_y (int (default=5)) – The maximum number of bins on variable y after pre-binning (prebins).
min_prebin_size_x (float (default=0.05)) – The fraction of mininum number of records for each prebin on variable x.
min_prebin_size_y (float (default=0.05)) – The fraction of mininum number of records for each prebin on variable y.
min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then
min_n_bins
is a value in[0, max_n_prebins]
.max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then
max_n_bins
is a value in[0, max_n_prebins]
.min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None,
min_bin_size = min_prebin_size
.max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None,
max_bin_size = 1.0
.monotonic_trend_x (str or None, optional (default=None)) – The mean monotonic trend on the x axis. Supported trends are “ascending”, and “descending”. If None, then the monotonic constraint is disabled.
monotonic_trend_y (str or None, optional (default=None)) – The mean monotonic trend on the y axis. Supported trends are “ascending”, and “descending”. If None, then the monotonic constraint is disabled.
min_mean_diff_x (float, optional (default=0)) – The minimum mean difference between consecutives bins on the x axis.
min_mean_diff_y (float, optional (default=0)) – The minimum mean difference between consecutives bins on the y axis.
gamma (float, optional (default=0)) – Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization.
special_codes_x (array-like or None, optional (default=None)) – List of special codes for the variable x. Use special codes to specify the data values that must be treated separately.
special_codes_y (array-like or None, optional (default=None)) – List of special codes for the variable y. Use special codes to specify the data values that must be treated separately.
split_digits (int or None, optional (default=None)) – The significant digits of the split points. If
split_digits
is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.n_jobs (int or None, optional (default=None)) – Number of cores to run in parallel while binning variables.
None
means 1 core.-1
means using all processors.time_limit (int (default=100)) – The maximum time in seconds to run the optimization solver.
verbose (bool (default=False)) – Enable verbose output.
-
property
binning_table
¶ Return an instantiated binning table. Please refer to Binning table: binary target.
- Returns
binning_table
- Return type
-
fit
(x, y, z, check_input=False)¶ Fit the optimal binning 2D according to the given training data.
- Parameters
x (array-like, shape = (n_samples,)) – Training vector x, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Training vector y, where n_samples is the number of samples.
z (array-like, shape = (n_samples,)) – Target vector relative to x and y.
check_input (bool (default=False)) – Whether to check input arrays.
- Returns
self – Fitted optimal binning 2D.
- Return type
-
fit_transform
(x, y, z, metric='mean', metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶ Fit the optimal binning 2D according to the given training data, then transform it.
- Parameters
x (array-like, shape = (n_samples,)) – Training vector x, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Training vector y, where n_samples is the number of samples.
z (array-like, shape = (n_samples,)) – Target vector relative to x and y.
metric (str (default="mean")) – The metric used to transform the input vector. Supported metrics are “mean” to choose the mean, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate, and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when
metric="bins"
.check_input (bool (default=False)) – Whether to check input arrays.
- Returns
z_new – Transformed array.
- Return type
numpy array, shape = (n_samples,)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
-
information
(print_level=1)¶ Print overview information about the options settings, problem statistics, and the solution of the computation.
- Parameters
print_level (int (default=1)) – Level of details.
-
read_json
(path)¶ Read json file containing split points and set them as the new split points.
- Parameters
path (The path of the json file.) –
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
-
property
splits
¶ List of optimal split points and bins for axis x and y.
- Returns
splits
- Return type
(numpy.ndarray, numpy.ndarray)
-
property
status
¶ The status of the underlying optimization solver.
- Returns
status
- Return type
str
-
to_json
(path)¶ Save optimal bins and/or splits points and transformation depending on the target type.
- Parameters
path (The path where the json is going to be saved.) –
-
transform
(x, y, metric='mean', metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶ Transform given data to mean using bins from the fitted optimal binning 2D.
- Parameters
x (array-like, shape = (n_samples,)) – Training vector x, where n_samples is the number of samples.
y (array-like, shape = (n_samples,)) – Training vector y, where n_samples is the number of samples.
metric (str (default="mean")) – The metric used to transform the input vector. Supported metrics are “mean” to choose the mean, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when
metric="bins"
.check_input (bool (default=False)) – Whether to check input arrays.
- Returns
z_new – Transformed array.
- Return type
numpy array, shape = (n_samples,)