Binning tables

Binning table: binary target

class optbinning.binning.binning_statistics.BinningTable(name, dtype, special_codes, splits, n_nonevent, n_event, min_x=None, max_x=None, categories=None, cat_others=None, user_splits=None)

Bases: object

Binning table to summarize optimal binning of a numerical or categorical variable with respect to a binary target.

Parameters
  • name (str, optional (default="")) – The variable name.

  • dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.

  • special_codes (array-like, dict or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

  • splits (numpy.ndarray) – List of split points.

  • n_nonevent (numpy.ndarray) – Number of non-events.

  • n_event (numpy.ndarray) – Number of events.

  • min_x (float or None (default=None)) – Mininum value of x.

  • max_x (float or None (default=None)) – Maxinum value of x.

  • categories (list, numpy.ndarray or None, optional (default=None)) – List of categories.

  • cat_others (list, numpy.ndarray or None, optional (default=None)) – List of categories in others’ bin.

  • user_splits (numpy.ndarray) – List of split points pass if prebins were passed by the user.

Warning

This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property binning_table available in all optimal binning classes.

analysis(pvalue_test='chi2', n_samples=100, print_output=True)

Binning table analysis.

Statistical analysis of the binning table, computing the statistics Gini index, Information Value (IV), Jensen-Shannon divergence, and the quality score. Additionally, several statistical significance tests between consecutive bins of the contingency table are performed: a frequentist test using the Chi-square test or the Fisher’s exact test, and a Bayesian A/B test using the beta distribution as a conjugate prior of the Bernoulli distribution.

Parameters
  • pvalue_test (str, optional (default="chi2")) – The statistical test. Supported test are “chi2” to choose the Chi-square test and “fisher” to choose the Fisher exact test.

  • n_samples (int, optional (default=100)) – The number of samples to run the Bayesian A/B testing between consecutive bins to compute the probability of the event rate of bin A being greater than the event rate of bin B.

  • print_output (bool (default=True)) – Whether to print analysis information.

Notes

The Chi-square test uses scipy.stats.chi2_contingency, and the Fisher exact test uses scipy.stats.fisher_exact.

build(show_digits=2, add_totals=True)

Build the binning table.

Parameters
  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column.

  • add_totals (bool (default=True)) – Whether to add a last row with totals.

Returns

binning_table

Return type

pandas.DataFrame

property gini

The Gini coefficient or Accuracy Ratio.

The Gini coefficient is a quantitative measure of the discriminatory and predictive power of a variable. The Gini coefficient ranges from 0 to 1.

Returns

gini

Return type

float

property hellinger

The Hellinger divergence.

Returns

hellinger

Return type

float

property iv

The Information Value (IV) or Jeffrey’s divergence measure.

The IV ranges from 0 to Infinity.

Returns

iv

Return type

float

property js

The Jensen-Shannon divergence measure (JS).

The JS ranges from 0 to \(\log(2)\).

Returns

js

Return type

float

property ks

The Kolmogorov-Smirnov statistic.

Returns

ks

Return type

float

plot(metric='woe', add_special=True, add_missing=True, style='bin', show_bin_labels=False, savefig=None, figsize=None)

Plot the binning table.

Visualize the non-event and event count, and the Weight of Evidence or the event rate for each bin.

Parameters
  • metric (str, optional (default="woe")) – Supported metrics are “woe” to show the Weight of Evidence (WoE) measure and “event_rate” to show the event rate.

  • add_special (bool (default=True)) – Whether to add the special codes bin.

  • add_missing (bool (default=True)) – Whether to add the special values bin.

  • style (str, optional (default="bin")) – Plot style. style=”bin” shows the standard binning plot. If style=”actual”, show the plot with the actual scale, i.e, actual bin widths.

  • show_bin_labels (bool (default=False)) –

    Whether to show the bin label instead of the bin id on the x-axis. For long labels (length > 27), labels are truncated.

    New in version 0.15.1.

  • savefig (str or None (default=None)) – Path to save the plot figure.

  • figsize (tuple or None (default=None)) – Size of the plot.

property quality_score

The quality score (QS).

The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.

Returns

quality_score

Return type

float

property triangular

The triangular divergence.

Returns

triangular

Return type

float

Binning table: continuous target

class optbinning.binning.binning_statistics.ContinuousBinningTable(name, dtype, special_codes, splits, n_records, sums, stds, min_target, max_target, n_zeros, min_x=None, max_x=None, categories=None, cat_others=None, user_splits=None)

Bases: object

Binning table to summarize optimal binning of a numerical or categorical variable with respect to a continuous target.

Parameters
  • name (str, optional (default="")) – The variable name.

  • dtype (str, optional (default="numerical")) – The variable data type. Supported data types are “numerical” for continuous and ordinal variables and “categorical” for categorical and nominal variables.

  • special_codes (array-like, dict or None, optional (default=None)) – List of special codes. Use special codes to specify the data values that must be treated separately.

  • splits (numpy.ndarray) – List of split points.

  • n_records (numpy.ndarray) – Number of records.

  • sums (numpy.ndarray) – Target sums.

  • stds (numpy.ndarray) – Target stds.

  • min_target (numpy.ndarray) – Target mininum values.

  • max_target (numpy.ndarray) – Target maxinum values.

  • n_zeros (numpy.ndarray) – Number of zeros.

  • min_x (float or None (default=None)) – Mininum value of x.

  • max_x (float or None (default=None)) – Maxinum value of x.

  • categories (list, numpy.ndarray or None, optional (default=None)) – List of categories.

  • cat_others (list, numpy.ndarray or None, optional (default=None)) – List of categories in others’ bin.

  • user_splits (numpy.ndarray) – List of split points pass if prebins were passed by the user.

Warning

This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property binning_table available in all optimal binning classes.

analysis(print_output=True)

Binning table analysis.

Statistical analysis of the binning table, computing the Information Value (IV) and Herfindahl-Hirschman Index (HHI).

Parameters

print_output (bool (default=True)) – Whether to print analysis information.

Notes

The IV for a continuous target is computed as follows:

\[IV = \sum_{i=1}^n |U_i - \mu| \frac{r_i}{r_T},\]

where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean, \(r_i\) is the number of records for each bin, and \(r_T\) is the total number of records.

build(show_digits=2, add_totals=True)

Build the binning table.

Parameters
  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column.

  • add_totals (bool (default=True)) – Whether to add a last row with totals.

Returns

binning_table

Return type

pandas.DataFrame

property iv

The Information Value (IV).

The IV ranges from 0 to Infinity.

Returns

iv

Return type

float

plot(add_special=True, add_missing=True, style='bin', show_bin_labels=False, savefig=None, figsize=None, metric='mean')

Plot the binning table.

Visualize records count and mean values.

Parameters
  • metric (str, optional (default="mean")) –

    Supported metrics are “mean” to show the Mean value of the target variable in each bin, “iv” to show the IV of each bin and “woe” to show the Weight of Evidence (WoE) of each bin.

    New in version 0.19.0.

  • add_special (bool (default=True)) – Whether to add the special codes bin.

  • add_missing (bool (default=True)) – Whether to add the special values bin.

  • style (str, optional (default="bin")) – Plot style. style=”bin” shows the standard binning plot. If style=”actual”, show the plot with the actual scale, i.e, actual bin widths.

  • show_bin_labels (bool (default=False)) –

    Whether to show the bin label instead of the bin id on the x-axis. For long labels (length > 27), labels are truncated.

    New in version 0.15.1.

  • savefig (str or None (default=None)) – Path to save the plot figure.

  • figsize (tuple or None (default=None)) – Size of the plot.

property quality_score

The quality score (QS).

The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.

Returns

quality_score

Return type

float

property woe

The sum of absolute WoEs.

This metric is computed as follows:

\[WoE = \sum_{i=1}^n |U_i - \mu|,\]

where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean.

Returns

woe

Return type

float

Binning table: multiclass target

class optbinning.binning.binning_statistics.MulticlassBinningTable(name, special_codes, splits, n_event, classes)

Bases: object

Binning table to summarize optimal binning of a numerical variable with respect to a multiclass or multilabel target.

Parameters
  • name (str, optional (default="")) – The variable name.

  • splits (numpy.ndarray) – List of split points.

  • n_event (numpy.ndarray) – Number of events.

  • classes (array-like) – List of classes.

Warning

This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property binning_table available in all optimal binning classes.

analysis(print_output=True)

Binning table analysis.

Statistical analysis of the binning table, computing the Jensen-shannon divergence and the quality score. Additionally, a statistical significance test between consecutive bins of the contingency table is performed using the Chi-square test.

Parameters

print_output (bool (default=True)) – Whether to print analysis information.

Notes

The Chi-square test uses scipy.stats.chi2_contingency.

build(show_digits=2, add_totals=True)

Build the binning table.

Parameters
  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column.

  • add_totals (bool (default=True)) – Whether to add a last row with totals.

Returns

binning_table

Return type

pandas.DataFrame

property js

The Jensen-Shannon divergence measure (JS).

The JS ranges from 0 to \(\log(n_{classes})\).

Returns

js

Return type

float

plot(add_special=True, add_missing=True, show_bin_labels=False, savefig=None, figsize=None)

Plot the binning table.

Visualize event count and event rate values for each class.

Parameters
  • add_special (bool (default=True)) – Whether to add the special codes bin.

  • add_missing (bool (default=True)) – Whether to add the special values bin.

  • show_bin_labels (bool (default=False)) –

    Whether to show the bin label instead of the bin id on the x-axis. For long labels (length > 27), labels are truncated.

    New in version 0.15.1.

  • savefig (str or None (default=None)) – Path to save the plot figure.

  • figsize (tuple or None (default=None)) – Size of the plot.

property quality_score

The quality score (QS).

The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.

Returns

quality_score

Return type

float