Binning tables¶
Binning table: binary target¶
-
class
optbinning.binning.binning_statistics.
BinningTable
(name, dtype, special_codes, splits, n_nonevent, n_event, min_x=None, max_x=None, categories=None, cat_others=None, user_splits=None)¶ Bases:
object
Binning table to summarize optimal binning of a numerical or categorical variable with respect to a binary target.
- Parameters
name (str, optional (default="")) – The variable name.
dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
special_codes (array-like, dict or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
splits (numpy.ndarray) – List of split points.
n_nonevent (numpy.ndarray) – Number of non-events.
n_event (numpy.ndarray) – Number of events.
min_x (float or None (default=None)) – Mininum value of x.
max_x (float or None (default=None)) – Maxinum value of x.
categories (list, numpy.ndarray or None, optional (default=None)) – List of categories.
cat_others (list, numpy.ndarray or None, optional (default=None)) – List of categories in others’ bin.
user_splits (numpy.ndarray) – List of split points pass if prebins were passed by the user.
Warning
This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property
binning_table
available in all optimal binning classes.-
analysis
(pvalue_test='chi2', n_samples=100, print_output=True)¶ Binning table analysis.
Statistical analysis of the binning table, computing the statistics Gini index, Information Value (IV), Jensen-Shannon divergence, and the quality score. Additionally, several statistical significance tests between consecutive bins of the contingency table are performed: a frequentist test using the Chi-square test or the Fisher’s exact test, and a Bayesian A/B test using the beta distribution as a conjugate prior of the Bernoulli distribution.
- Parameters
pvalue_test (str, optional (default="chi2")) – The statistical test. Supported test are “chi2” to choose the Chi-square test and “fisher” to choose the Fisher exact test.
n_samples (int, optional (default=100)) – The number of samples to run the Bayesian A/B testing between consecutive bins to compute the probability of the event rate of bin A being greater than the event rate of bin B.
print_output (bool (default=True)) – Whether to print analysis information.
Notes
The Chi-square test uses scipy.stats.chi2_contingency, and the Fisher exact test uses scipy.stats.fisher_exact.
-
build
(show_digits=2, add_totals=True)¶ Build the binning table.
- Parameters
show_digits (int, optional (default=2)) – The number of significant digits of the bin column.
add_totals (bool (default=True)) – Whether to add a last row with totals.
- Returns
binning_table
- Return type
pandas.DataFrame
-
property
gini
¶ The Gini coefficient or Accuracy Ratio.
The Gini coefficient is a quantitative measure of the discriminatory and predictive power of a variable. The Gini coefficient ranges from 0 to 1.
- Returns
gini
- Return type
float
-
property
hellinger
¶ The Hellinger divergence.
- Returns
hellinger
- Return type
float
-
property
iv
¶ The Information Value (IV) or Jeffrey’s divergence measure.
The IV ranges from 0 to Infinity.
- Returns
iv
- Return type
float
-
property
js
¶ The Jensen-Shannon divergence measure (JS).
The JS ranges from 0 to \(\log(2)\).
- Returns
js
- Return type
float
-
property
ks
¶ The Kolmogorov-Smirnov statistic.
- Returns
ks
- Return type
float
-
plot
(metric='woe', add_special=True, add_missing=True, style='bin', show_bin_labels=False, savefig=None, figsize=None)¶ Plot the binning table.
Visualize the non-event and event count, and the Weight of Evidence or the event rate for each bin.
- Parameters
metric (str, optional (default="woe")) – Supported metrics are “woe” to show the Weight of Evidence (WoE) measure and “event_rate” to show the event rate.
add_special (bool (default=True)) – Whether to add the special codes bin.
add_missing (bool (default=True)) – Whether to add the special values bin.
style (str, optional (default="bin")) – Plot style. style=”bin” shows the standard binning plot. If style=”actual”, show the plot with the actual scale, i.e, actual bin widths.
show_bin_labels (bool (default=False)) –
Whether to show the bin label instead of the bin id on the x-axis. For long labels (length > 27), labels are truncated.
New in version 0.15.1.
savefig (str or None (default=None)) – Path to save the plot figure.
figsize (tuple or None (default=None)) – Size of the plot.
-
property
quality_score
¶ The quality score (QS).
The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.
- Returns
quality_score
- Return type
float
-
property
triangular
¶ The triangular divergence.
- Returns
triangular
- Return type
float
Binning table: continuous target¶
-
class
optbinning.binning.binning_statistics.
ContinuousBinningTable
(name, dtype, special_codes, splits, n_records, sums, stds, min_target, max_target, n_zeros, min_x=None, max_x=None, categories=None, cat_others=None, user_splits=None)¶ Bases:
object
Binning table to summarize optimal binning of a numerical or categorical variable with respect to a continuous target.
- Parameters
name (str, optional (default="")) – The variable name.
dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
special_codes (array-like, dict or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
splits (numpy.ndarray) – List of split points.
n_records (numpy.ndarray) – Number of records.
sums (numpy.ndarray) – Target sums.
stds (numpy.ndarray) – Target stds.
min_target (numpy.ndarray) – Target mininum values.
max_target (numpy.ndarray) – Target maxinum values.
n_zeros (numpy.ndarray) – Number of zeros.
min_x (float or None (default=None)) – Mininum value of x.
max_x (float or None (default=None)) – Maxinum value of x.
categories (list, numpy.ndarray or None, optional (default=None)) – List of categories.
cat_others (list, numpy.ndarray or None, optional (default=None)) – List of categories in others’ bin.
user_splits (numpy.ndarray) – List of split points pass if prebins were passed by the user.
Warning
This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property
binning_table
available in all optimal binning classes.-
analysis
(print_output=True)¶ Binning table analysis.
Statistical analysis of the binning table, computing the Information Value (IV) and Herfindahl-Hirschman Index (HHI).
- Parameters
print_output (bool (default=True)) – Whether to print analysis information.
Notes
The IV for a continuous target is computed as follows:
\[IV = \sum_{i=1}^n |U_i - \mu| \frac{r_i}{r_T},\]where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean, \(r_i\) is the number of records for each bin, and \(r_T\) is the total number of records.
-
build
(show_digits=2, add_totals=True)¶ Build the binning table.
- Parameters
show_digits (int, optional (default=2)) – The number of significant digits of the bin column.
add_totals (bool (default=True)) – Whether to add a last row with totals.
- Returns
binning_table
- Return type
pandas.DataFrame
-
property
iv
¶ The Information Value (IV).
The IV ranges from 0 to Infinity.
- Returns
iv
- Return type
float
-
plot
(add_special=True, add_missing=True, style='bin', show_bin_labels=False, savefig=None, figsize=None, metric='mean')¶ Plot the binning table.
Visualize records count and mean values.
- Parameters
metric (str, optional (default="mean")) –
Supported metrics are “mean” to show the Mean value of the target variable in each bin, “iv” to show the IV of each bin and “woe” to show the Weight of Evidence (WoE) of each bin.
New in version 0.19.0.
add_special (bool (default=True)) – Whether to add the special codes bin.
add_missing (bool (default=True)) – Whether to add the special values bin.
style (str, optional (default="bin")) – Plot style. style=”bin” shows the standard binning plot. If style=”actual”, show the plot with the actual scale, i.e, actual bin widths.
show_bin_labels (bool (default=False)) –
Whether to show the bin label instead of the bin id on the x-axis. For long labels (length > 27), labels are truncated.
New in version 0.15.1.
savefig (str or None (default=None)) – Path to save the plot figure.
figsize (tuple or None (default=None)) – Size of the plot.
-
property
quality_score
¶ The quality score (QS).
The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.
- Returns
quality_score
- Return type
float
-
property
woe
¶ The sum of absolute WoEs.
This metric is computed as follows:
\[WoE = \sum_{i=1}^n |U_i - \mu|,\]where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean.
- Returns
woe
- Return type
float
Binning table: multiclass target¶
-
class
optbinning.binning.binning_statistics.
MulticlassBinningTable
(name, special_codes, splits, n_event, classes)¶ Bases:
object
Binning table to summarize optimal binning of a numerical variable with respect to a multiclass or multilabel target.
- Parameters
name (str, optional (default="")) – The variable name.
splits (numpy.ndarray) – List of split points.
n_event (numpy.ndarray) – Number of events.
classes (array-like) – List of classes.
Warning
This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property
binning_table
available in all optimal binning classes.-
analysis
(print_output=True)¶ Binning table analysis.
Statistical analysis of the binning table, computing the Jensen-shannon divergence and the quality score. Additionally, a statistical significance test between consecutive bins of the contingency table is performed using the Chi-square test.
- Parameters
print_output (bool (default=True)) – Whether to print analysis information.
Notes
The Chi-square test uses scipy.stats.chi2_contingency.
-
build
(show_digits=2, add_totals=True)¶ Build the binning table.
- Parameters
show_digits (int, optional (default=2)) – The number of significant digits of the bin column.
add_totals (bool (default=True)) – Whether to add a last row with totals.
- Returns
binning_table
- Return type
pandas.DataFrame
-
property
js
¶ The Jensen-Shannon divergence measure (JS).
The JS ranges from 0 to \(\log(n_{classes})\).
- Returns
js
- Return type
float
-
plot
(add_special=True, add_missing=True, show_bin_labels=False, savefig=None, figsize=None)¶ Plot the binning table.
Visualize event count and event rate values for each class.
- Parameters
add_special (bool (default=True)) – Whether to add the special codes bin.
add_missing (bool (default=True)) – Whether to add the special values bin.
show_bin_labels (bool (default=False)) –
Whether to show the bin label instead of the bin id on the x-axis. For long labels (length > 27), labels are truncated.
New in version 0.15.1.
savefig (str or None (default=None)) – Path to save the plot figure.
figsize (tuple or None (default=None)) – Size of the plot.
-
property
quality_score
¶ The quality score (QS).
The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.
- Returns
quality_score
- Return type
float