Binning process¶
-
class
optbinning.
BinningProcess
(variable_names, max_n_prebins=20, min_prebin_size=0.05, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, max_pvalue=None, max_pvalue_policy='consecutive', selection_criteria=None, fixed_variables=None, categorical_variables=None, special_codes=None, split_digits=None, binning_fit_params=None, binning_transform_params=None, n_jobs=None, verbose=False)¶ Bases:
optbinning.binning.base.Base
,sklearn.base.BaseEstimator
,optbinning.binning.binning_process.BaseBinningProcess
Binning process to compute optimal binning of variables in a dataset, given a binary, continuous or multiclass target dtype.
- Parameters
variable_names (array-like) – List of variable names.
max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).
min_prebin_size (float (default=0.05)) – The fraction of mininum number of records for each prebin.
min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then
min_n_bins
is a value in[0, max_n_prebins]
.max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then
max_n_bins
is a value in[0, max_n_prebins]
.min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None,
min_bin_size = min_prebin_size
.max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None,
max_bin_size = 1.0
.max_pvalue (float or None, optional (default=None)) – The maximum p-value among bins.
max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.
selection_criteria (dict or None (default=None)) –
Variable selection criteria. See notes.
New in version 0.6.0.
fixed_variables (array-like or None) –
List of variables to be fixed. The binning process will retain these variables if the selection criteria is not satisfied.
New in version 0.12.1.
special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
split_digits (int or None, optional (default=None)) – The significant digits of the split points. If
split_digits
is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.categorical_variables (array-like or None, optional (default=None)) – List of variables numerical variables to be considered categorical. These are nominal variables. Not applicable when target type is multiclass.
binning_fit_params (dict or None, optional (default=None)) – Dictionary with optimal binning fitting options for specific variables. Example:
{"variable_1": {"max_n_bins": 4}}
.binning_transform_params (dict or None, optional (default=None)) – Dictionary with optimal binning transform options for specific variables. Example
{"variable_1": {"metric": "event_rate"}}
.n_jobs (int or None, optional (default=None)) –
Number of cores to run in parallel while binning variables.
None
means 1 core.-1
means using all processors.New in version 0.7.1.
verbose (bool (default=False)) – Enable verbose output.
Notes
Parameter
selection_criteria
allows to specify criteria for variable selection. The input is a dictionary as followsselection_criteria = { "metric_1": { "min": 0, "max": 1, "strategy": "highest", "top": 0.25 }, "metric_2": { "min": 0.02 } }
where several metrics can be combined. For example, above dictionary indicates that top 25% variables with “metric_1” in [0, 1] and “metric_2” greater or equal than 0.02 are selected. Supported key values are:
keys
min
andmax
support numerical values.key
strategy
supports options “highest” and “lowest”.key
top
supports an integer or decimal (percentage).
Warning
If the binning process instance is going to be saved, do not pass the option
"solver": "mip"
via thebinning_fit_params
parameter.-
fit
(X, y, sample_weight=None, check_input=False)¶ Fit the binning process. Fit the optimal binning to all variables according to the given training data.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) –
Training vector, where n_samples is the number of samples.
Changed in version 0.4.0.
X supports
numpy.ndarray
andpandas.DataFrame
.y (array-like of shape (n_samples,)) – Target vector relative to x.
sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Only applied if
prebinning_method="cart"
. This option is only available for a binary target.check_input (bool (default=False)) – Whether to check input arrays.
- Returns
self – Fitted binning process.
- Return type
-
fit_disk
(input_path, target, **kwargs)¶ Fit the binning process according to the given training data on disk.
- Parameters
input_path (str) – Any valid string path to a file with extension .csv or .parquet.
target (str) – Target column.
**kwargs (keyword arguments) – Keyword arguments for
pandas.read_csv
orpandas.read_parquet
.
- Returns
self – Fitted binning process.
- Return type
-
fit_from_dict
(dict_optb)¶ Fit the binning process from a dict of OptimalBinning objects already fitted.
- Parameters
dict_optb (dict) – Dictionary with OptimalBinning objects for binary, continuous or multiclass target. All objects must share the same class.
- Returns
self – Fitted binning process.
- Return type
-
fit_transform
(X, y, sample_weight=None, metric=None, metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶ Fit the binning process according to the given training data, then transform it.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples.
y (array-like of shape (n_samples,)) – Target vector relative to x.
sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Only applied if
prebinning_method="cart"
. This option is only available for a binary target.metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when
metric="bins"
.check_input (bool (default=False)) – Whether to check input arrays.
- Returns
X_new – Transformed array.
- Return type
numpy array, shape = (n_samples, n_features_new)
-
fit_transform_disk
(input_path, output_path, target, chunksize, metric=None, metric_special=0, metric_missing=0, show_digits=2, **kwargs)¶ Fit the binning process according to the given training data on disk, then transform it and save to comma-separated values (csv) file.
- Parameters
input_path (str) – Any valid string path to a file with extension .csv.
output_path (str) – Any valid string path to a file with extension .csv.
target (str) – Target column.
chunksize – Rows to read, transform and write at a time.
metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when
metric="bins"
.**kwargs (keyword arguments) – Keyword arguments for
pandas.read_csv
.
- Returns
self – Fitted binning process.
- Return type
-
get_binned_variable
(name)¶ Return optimal binning object for a given variable name.
- Parameters
name (string) – The variable name.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
-
get_support
(indices=False, names=False)¶ Get a mask, or integer index, or names of the variables selected.
- Parameters
indices (boolean (default=False)) – If True, the return value will be an array of integers, rather than a boolean mask.
names (boolean (default=False)) – If True, the return value will be an array of strings, rather than a boolean mask.
- Returns
support – An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector. If names is True, this is an string array of sahpe [# output features], whose values are names of the selected features.
- Return type
array
-
information
(print_level=1)¶ Print overview information about the options settings and statistics.
- Parameters
print_level (int (default=1)) – Level of details.
-
classmethod
load
(path)¶ Load binning process from pickle file.
- Parameters
path (str) – Pickle file path.
Example
>>> from optbinning import BinningProcess >>> binning_process = BinningProcess.load("my_binning_process.pkl")
-
save
(path)¶ Save binning process to pickle file.
- Parameters
path (str) – Pickle file path.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
-
summary
()¶ Binning process summary with main statistics for all binned variables.
- Parameters
df_summary (pandas.DataFrame) – Binning process summary.
-
transform
(X, metric=None, metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶ Transform given data to metric using bins from each fitted optimal binning.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples.
metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when
metric="bins"
.check_input (bool (default=False)) – Whether to check input arrays.
- Returns
X_new (numpy array or pandas.DataFrame, shape = (n_samples,)
n_features_new) – Transformed array.
-
transform_disk
(input_path, output_path, chunksize, metric=None, metric_special=0, metric_missing=0, show_digits=2, **kwargs)¶ Transform given data on disk to metric using bins from each fitted optimal binning. Save to comma-separated values (csv) file.
- Parameters
input_path (str) – Any valid string path to a file with extension .csv.
output_path (str) – Any valid string path to a file with extension .csv.
chunksize – Rows to read, transform and write at a time.
metric (str or None, (default=None)) – The metric used to transform the input vector. If None, the default transformation metric for each target type is applied. For binary target options are: “woe” (default), “event_rate”, “indices” and “bins”. For continuous target options are: “mean” (default), “indices” and “bins”. For multiclass target options are: “mean_woe” (default), “weighted_mean_woe”, “indices” and “bins”.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate for a binary target, and any numerical value for other targets.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when
metric="bins"
.**kwargs (keyword arguments) – Keyword arguments for
pandas.read_csv
.
- Returns
self – Fitted binning process.
- Return type
-
update_binned_variable
(name, optb)¶ Update optimal binning object for a given variable.
- Parameters
name (string) – The variable name.
optb (object) – The optimal binning object already fitted.