Binning tables¶

Binning table: binary target¶

class optbinning.binning.binning_statistics.BinningTable(name, dtype, special_codes, splits, n_nonevent, n_event, min_x=None, max_x=None, categories=None, cat_others=None, user_splits=None)¶

Bases: object

Binning table to summarize optimal binning of a numerical or categorical variable with respect to a binary target.

Parameters

name (str, optional (default="")) – The variable name.
dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
special_codes (array-like, dict or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
splits (numpy.ndarray) – List of split points.
n_nonevent (numpy.ndarray) – Number of non-events.
n_event (numpy.ndarray) – Number of events.
min_x (float or None (default=None)) – Mininum value of x.
max_x (float or None (default=None)) – Maxinum value of x.
categories (list, numpy.ndarray or None, optional (default=None)) – List of categories.
cat_others (list, numpy.ndarray or None, optional (default=None)) – List of categories in others’ bin.
user_splits (numpy.ndarray) – List of split points pass if prebins were passed by the user.

Warning

This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property binning_table available in all optimal binning classes.

analysis(pvalue_test='chi2', n_samples=100, print_output=True)¶

Binning table analysis.

Statistical analysis of the binning table, computing the statistics Gini index, Information Value (IV), Jensen-Shannon divergence, and the quality score. Additionally, several statistical significance tests between consecutive bins of the contingency table are performed: a frequentist test using the Chi-square test or the Fisher’s exact test, and a Bayesian A/B test using the beta distribution as a conjugate prior of the Bernoulli distribution.

Parameters

pvalue_test (str, optional (default="chi2")) – The statistical test. Supported test are “chi2” to choose the Chi-square test and “fisher” to choose the Fisher exact test.
n_samples (int, optional (default=100)) – The number of samples to run the Bayesian A/B testing between consecutive bins to compute the probability of the event rate of bin A being greater than the event rate of bin B.
print_output (bool (default=True)) – Whether to print analysis information.

Notes

The Chi-square test uses scipy.stats.chi2_contingency, and the Fisher exact test uses scipy.stats.fisher_exact.

build(show_digits=2, add_totals=True)¶

Build the binning table.

Parameters

show_digits (int, optional (default=2)) – The number of significant digits of the bin column.
add_totals (bool (default=True)) – Whether to add a last row with totals.

Returns

binning_table

Return type

pandas.DataFrame

property gini¶

The Gini coefficient or Accuracy Ratio.

The Gini coefficient is a quantitative measure of the discriminatory and predictive power of a variable. The Gini coefficient ranges from 0 to 1.

Returns: gini
Return type: float

property hellinger¶

The Hellinger divergence.

Returns: hellinger
Return type: float

property iv¶

The Information Value (IV) or Jeffrey’s divergence measure.

The IV ranges from 0 to Infinity.

Returns: iv
Return type: float

property js¶

The Jensen-Shannon divergence measure (JS).

The JS ranges from 0 to \(\log(2)\).

Returns: js
Return type: float

property ks¶

The Kolmogorov-Smirnov statistic.

Returns: ks
Return type: float

plot(metric='woe', add_special=True, add_missing=True, style='bin', show_bin_labels=False, savefig=None, figsize=None)¶

Plot the binning table.

Visualize the non-event and event count, and the Weight of Evidence or the event rate for each bin.

Parameters

metric (str, optional (default="woe")) – Supported metrics are “woe” to show the Weight of Evidence (WoE) measure and “event_rate” to show the event rate.
add_special (bool (default=True)) – Whether to add the special codes bin.
add_missing (bool (default=True)) – Whether to add the special values bin.
style (str, optional (default="bin")) – Plot style. style=”bin” shows the standard binning plot. If style=”actual”, show the plot with the actual scale, i.e, actual bin widths.
show_bin_labels (bool (default=False)) –
Whether to show the bin label instead of the bin id on the x-axis. For long labels (length > 27), labels are truncated.

New in version 0.15.1.
savefig (str or None (default=None)) – Path to save the plot figure.
figsize (tuple or None (default=None)) – Size of the plot.

property quality_score¶

The quality score (QS).

The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.

Returns: quality_score
Return type: float

property triangular¶

The triangular divergence.

Returns: triangular
Return type: float

Binning table: continuous target¶

class optbinning.binning.binning_statistics.ContinuousBinningTable(name, dtype, special_codes, splits, n_records, sums, stds, min_target, max_target, n_zeros, min_x=None, max_x=None, categories=None, cat_others=None, user_splits=None)¶

Bases: object

Binning table to summarize optimal binning of a numerical or categorical variable with respect to a continuous target.

Parameters

name (str, optional (default="")) – The variable name.
dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.
special_codes (array-like, dict or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.
splits (numpy.ndarray) – List of split points.
n_records (numpy.ndarray) – Number of records.
sums (numpy.ndarray) – Target sums.
stds (numpy.ndarray) – Target stds.
min_target (numpy.ndarray) – Target mininum values.
max_target (numpy.ndarray) – Target maxinum values.
n_zeros (numpy.ndarray) – Number of zeros.
min_x (float or None (default=None)) – Mininum value of x.
max_x (float or None (default=None)) – Maxinum value of x.
categories (list, numpy.ndarray or None, optional (default=None)) – List of categories.
cat_others (list, numpy.ndarray or None, optional (default=None)) – List of categories in others’ bin.
user_splits (numpy.ndarray) – List of split points pass if prebins were passed by the user.

Warning

This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property binning_table available in all optimal binning classes.

analysis(print_output=True)¶

Binning table analysis.

Statistical analysis of the binning table, computing the Information Value (IV) and Herfindahl-Hirschman Index (HHI).

Parameters: print_output (bool (default=True)) – Whether to print analysis information.

Notes

The IV for a continuous target is computed as follows:

\[IV = \sum_{i=1}^n |U_i - \mu| \frac{r_i}{r_T},\]

where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean, \(r_i\) is the number of records for each bin, and \(r_T\) is the total number of records.

build(show_digits=2, add_totals=True)¶

Build the binning table.

Parameters

show_digits (int, optional (default=2)) – The number of significant digits of the bin column.
add_totals (bool (default=True)) – Whether to add a last row with totals.

Returns

binning_table

Return type

pandas.DataFrame

property iv¶

The Information Value (IV).

The IV ranges from 0 to Infinity.

Returns: iv
Return type: float

plot(add_special=True, add_missing=True, style='bin', show_bin_labels=False, savefig=None, figsize=None, metric='mean')¶

Plot the binning table.

Visualize records count and mean values.

Parameters

metric (str, optional (default="mean")) –
Supported metrics are “mean” to show the Mean value of the target variable in each bin, “iv” to show the IV of each bin and “woe” to show the Weight of Evidence (WoE) of each bin.

New in version 0.19.0.
add_special (bool (default=True)) – Whether to add the special codes bin.
add_missing (bool (default=True)) – Whether to add the special values bin.
style (str, optional (default="bin")) – Plot style. style=”bin” shows the standard binning plot. If style=”actual”, show the plot with the actual scale, i.e, actual bin widths.
show_bin_labels (bool (default=False)) –
Whether to show the bin label instead of the bin id on the x-axis. For long labels (length > 27), labels are truncated.

New in version 0.15.1.
savefig (str or None (default=None)) – Path to save the plot figure.
figsize (tuple or None (default=None)) – Size of the plot.

property quality_score¶

The quality score (QS).

The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.

Returns: quality_score
Return type: float

property woe¶

The sum of absolute WoEs.

This metric is computed as follows:

\[WoE = \sum_{i=1}^n |U_i - \mu|,\]

where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean.

Returns: woe
Return type: float

Binning table: multiclass target¶

class optbinning.binning.binning_statistics.MulticlassBinningTable(name, special_codes, splits, n_event, classes)¶

Bases: object

Binning table to summarize optimal binning of a numerical variable with respect to a multiclass or multilabel target.

Parameters

name (str, optional (default="")) – The variable name.
splits (numpy.ndarray) – List of split points.
n_event (numpy.ndarray) – Number of events.
classes (array-like) – List of classes.

Warning

This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property binning_table available in all optimal binning classes.

analysis(print_output=True)¶

Binning table analysis.

Statistical analysis of the binning table, computing the Jensen-shannon divergence and the quality score. Additionally, a statistical significance test between consecutive bins of the contingency table is performed using the Chi-square test.

Parameters: print_output (bool (default=True)) – Whether to print analysis information.

Notes

The Chi-square test uses scipy.stats.chi2_contingency.

build(show_digits=2, add_totals=True)¶

Build the binning table.

Parameters

show_digits (int, optional (default=2)) – The number of significant digits of the bin column.
add_totals (bool (default=True)) – Whether to add a last row with totals.

Returns

binning_table

Return type

pandas.DataFrame

property js¶

The Jensen-Shannon divergence measure (JS).

The JS ranges from 0 to \(\log(n_{classes})\).

Returns: js
Return type: float

plot(add_special=True, add_missing=True, show_bin_labels=False, savefig=None, figsize=None)¶

Plot the binning table.

Visualize event count and event rate values for each class.

Parameters

add_special (bool (default=True)) – Whether to add the special codes bin.
add_missing (bool (default=True)) – Whether to add the special values bin.
show_bin_labels (bool (default=False)) –
Whether to show the bin label instead of the bin id on the x-axis. For long labels (length > 27), labels are truncated.

New in version 0.15.1.
savefig (str or None (default=None)) – Path to save the plot figure.
figsize (tuple or None (default=None)) – Size of the plot.

property quality_score¶

The quality score (QS).

The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.

Returns: quality_score
Return type: float