Binning process sketch with binary target¶
-
class
optbinning.
BinningProcessSketch
(variable_names, max_n_prebins=20, min_n_bins=None, max_n_bins=None, min_bin_size=None, max_bin_size=None, max_pvalue=None, max_pvalue_policy='consecutive', selection_criteria=None, categorical_variables=None, special_codes=None, split_digits=None, binning_fit_params=None, binning_transform_params=None, verbose=False)¶ Bases:
optbinning.binning.distributed.base.BaseSketch
,sklearn.base.BaseEstimator
,optbinning.binning.binning_process.BaseBinningProcess
Binning process over data streams to compute optimal binning of variables with respect to a binary target.
- Parameters
variable_names (array-like) – List of variable names.
max_n_prebins (int (default=20)) – The maximum number of bins after pre-binning (prebins).
min_n_bins (int or None, optional (default=None)) – The minimum number of bins. If None, then
min_n_bins
is a value in[0, max_n_prebins]
.max_n_bins (int or None, optional (default=None)) – The maximum number of bins. If None, then
max_n_bins
is a value in[0, max_n_prebins]
.min_bin_size (float or None, optional (default=None)) – The fraction of minimum number of records for each bin. If None,
min_bin_size = min_prebin_size
.max_bin_size (float or None, optional (default=None)) – The fraction of maximum number of records for each bin. If None,
max_bin_size = 1.0
.max_pvalue (float or None, optional (default=None)) – The maximum p-value among bins.
max_pvalue_policy (str, optional (default="consecutive")) – The method to determine bins not satisfying the p-value constraint. Supported methods are “consecutive” to compare consecutive bins and “all” to compare all bins.
selection_criteria (dict or None (default=None)) – Variable selection criteria. See notes.
special_codes (array-like or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
split_digits (int or None, optional (default=None)) – The significant digits of the split points. If
split_digits
is set to 0, the split points are integers. If None, then all significant digits in the split points are considered.categorical_variables (array-like or None, optional (default=None)) – List of variables numerical variables to be considered categorical. These are nominal variables. Not applicable when target type is multiclass.
binning_fit_params (dict or None, optional (default=None)) – Dictionary with optimal binning fitting options for specific variables. Example:
{"variable_1": {"max_n_bins": 4}}
.binning_transform_params (dict or None, optional (default=None)) – Dictionary with optimal binning transform options for specific variables. Example
{"variable_1": {"metric": "event_rate"}}
.verbose (bool (default=False)) – Enable verbose output.
Notes
Parameter
selection_criteria
allows to specify criteria for variable selection. The input is a dictionary as followsselection_criteria = { "metric_1": { "min": 0, "max": 1, "strategy": "highest", "top": 0.25 }, "metric_2": { "min": 0.02 } }
where several metrics can be combined. For example, above dictionary indicates that top 25% variables with “metric_1” in [0, 1] and “metric_2” greater or equal than 0.02 are selected. Supported key values are:
keys
min
andmax
support numerical values.key
strategy
supports options “highest” and “lowest”.key
top
supports an integer or decimal (percentage).
Warning
If the binning process instance is going to be saved, do not pass the option
"solver": "mip"
via the binning_fit_params parameter.-
add
(X, y, check_input=False)¶ Add new data X, y to the binning sketch of each variable.
- Parameters
X (pandas.DataFrame, shape (n_samples, n_features)) –
y (array-like of shape (n_samples,)) – Target vector relative to x.
check_input (bool (default=False)) – Whether to check input arrays.
- Returns
self – Binning process with new data.
- Return type
-
get_binned_variable
(name)¶ Return optimal binning sketch object for a given variable name.
- Parameters
name (string) – The variable name.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
-
get_support
(indices=False, names=False)¶ Get a mask, or integer index, or names of the variables selected.
- Parameters
indices (boolean (default=False)) – If True, the return value will be an array of integers, rather than a boolean mask.
names (boolean (default=False)) – If True, the return value will be an array of strings, rather than a boolean mask.
- Returns
support – An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector. If names is True, this is an string array of sahpe [# output features], whose values are names of the selected features.
- Return type
array
-
information
(print_level=1)¶ Print overview information about the options settings and statistics.
- Parameters
print_level (int (default=1)) – Level of details.
-
classmethod
load
(path)¶ Load binning process from pickle file.
- Parameters
path (str) – Pickle file path.
Example
>>> from optbinning import BinningProcess >>> binning_process = BinningProcess.load("my_binning_process.pkl")
-
merge
(bpsketch)¶ Merge current instance with another BinningProcessSketch instance.
- Parameters
bpsketch (object) – BinningProcessSketch instance.
-
mergeable
(bpsketch)¶ Check whether two BinningProcessSketch instances can be merged.
- Parameters
bpsketch (object) – BinningProcessSketch instance.
- Returns
mergeable
- Return type
bool
-
save
(path)¶ Save binning process to pickle file.
- Parameters
path (str) – Pickle file path.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
-
solve
()¶ Solve optimal binning for all variables using added data.
- Returns
self – Current fitted binning process.
- Return type
-
summary
()¶ Binning process summary with main statistics for all binned variables.
- Parameters
df_summary (pandas.DataFrame) – Binning process summary.
-
transform
(X, metric='woe', metric_special=0, metric_missing=0, show_digits=2, check_input=False)¶ Transform given data to metric using bins from each fitted optimal binning.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples.
metric (str (default="woe")) – The metric used to transform the input vector. Supported metrics are “woe” to choose the Weight of Evidence, “event_rate” to choose the event rate, “indices” to assign the corresponding indices of the bins and “bins” to assign the corresponding bin interval.
metric_special (float or str (default=0)) – The metric value to transform special codes in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
metric_missing (float or str (default=0)) – The metric value to transform missing values in the input vector. Supported metrics are “empirical” to use the empirical WoE or event rate and any numerical value.
show_digits (int, optional (default=2)) – The number of significant digits of the bin column. Applies when
metric="bins"
.check_input (bool (default=False)) – Whether to check input arrays.
- Returns
X_new – Transformed array.
- Return type
pandas.DataFrame, shape = (n_samples, n_features_new)