Binning 2D tables

Binning table 2D: binary target

class optbinning.binning.multidimensional.binning_statistics_2d.BinningTable2D(name_x, name_y, dtype_x, dtype_y, splits_x, splits_y, m, n, n_nonevent, n_event, D, P)

Bases: optbinning.binning.binning_statistics.BinningTable

Binning table to summarize optimal binning of two numerical variables with respect to a binary target.

Parameters
  • name_x (str, optional (default="")) – The name of variable x.

  • name_y (str, optional (default="")) – The name of variable y.

  • dtype_x (str, optional (default="numerical")) – The data type of variable x. Supported data type is “numerical” for continuous and ordinal variables.

  • dtype_y (str, optional (default="numerical")) – The data type of variable y. Supported data type is “numerical” for continuous and ordinal variables.

  • splits_x (numpy.ndarray) – List of split points for variable x.

  • splits_y (numpy.ndarray) – List of split points for variable y.

  • m (int) – Number of rows of the 2D array.

  • n (int) – Number of columns of the 2D array.

  • n_nonevent (numpy.ndarray) – Number of non-events.

  • n_event (numpy.ndarray) – Number of events.

  • D (numpy.ndarray) – Event rate 2D array.

  • P (numpy-ndarray) – Bin indices 2D array.

Warning

This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property binning_table available in all optimal binning classes.

analysis(pvalue_test='chi2', n_samples=100, print_output=True)

Binning table analysis.

Statistical analysis of the binning table, computing the statistics Gini index, Information Value (IV), Jensen-Shannon divergence, and the quality score. Additionally, several statistical significance tests between consecutive bins of the contingency table are performed: a frequentist test using the Chi-square test or the Fisher’s exact test, and a Bayesian A/B test using the beta distribution as a conjugate prior of the Bernoulli distribution.

Parameters
  • pvalue_test (str, optional (default="chi2")) – The statistical test. Supported test are “chi2” to choose the Chi-square test and “fisher” to choose the Fisher exact test.

  • n_samples (int, optional (default=100)) – The number of samples to run the Bayesian A/B testing between consecutive bins to compute the probability of the event rate of bin A being greater than the event rate of bin B.

  • print_output (bool (default=True)) – Whether to print analysis information.

Notes

The Chi-square test uses scipy.stats.chi2_contingency, and the Fisher exact test uses scipy.stats.fisher_exact.

build(show_digits=2, show_bin_xy=False, add_totals=True)

Build the binning table.

Parameters
  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column.

  • show_bin_xy (bool (default=False)) – Whether to show a single bin column with x and y.

  • add_totals (bool (default=True)) – Whether to add a last row with totals.

Returns

binning_table

Return type

pandas.DataFrame

property gini

The Gini coefficient or Accuracy Ratio.

The Gini coefficient is a quantitative measure of the discriminatory and predictive power of a variable. The Gini coefficient ranges from 0 to 1.

Returns

gini

Return type

float

property hellinger

The Hellinger divergence.

Returns

hellinger

Return type

float

property iv

The Information Value (IV) or Jeffrey’s divergence measure.

The IV ranges from 0 to Infinity.

Returns

iv

Return type

float

property js

The Jensen-Shannon divergence measure (JS).

The JS ranges from 0 to \(\log(2)\).

Returns

js

Return type

float

property ks

The Kolmogorov-Smirnov statistic.

Returns

ks

Return type

float

plot(metric='woe', savefig=None)

Plot the binning table.

Visualize the Weight of Evidence or the event rate for each bin as a matrix, and the x and y trend.

Parameters
  • metric (str, optional (default="woe")) – Supported metrics are “woe” to show the Weight of Evidence (WoE) measure and “event_rate” to show the event rate.

  • savefig (str or None (default=None)) – Path to save the plot figure.

property quality_score

The quality score (QS).

The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.

Returns

quality_score

Return type

float

property triangular

The triangular divergence.

Returns

triangular

Return type

float

Binning table 2D: continuous target

class optbinning.binning.multidimensional.binning_statistics_2d.ContinuousBinningTable2D(name_x, name_y, dtype_x, dtype_y, splits_x, splits_y, m, n, n_records, sums, stds, D, P)

Bases: optbinning.binning.binning_statistics.ContinuousBinningTable

Binning table to summarize optimal binning of two numerical variables with respect to a binary target.

Parameters
  • name_x (str, optional (default="")) – The name of variable x.

  • name_y (str, optional (default="")) – The name of variable y.

  • dtype_x (str, optional (default="numerical")) – The data type of variable x. Supported data type is “numerical” for continuous and ordinal variables.

  • dtype_y (str, optional (default="numerical")) – The data type of variable y. Supported data type is “numerical” for continuous and ordinal variables.

  • splits_x (numpy.ndarray) – List of split points for variable x.

  • splits_y (numpy.ndarray) – List of split points for variable y.

  • m (int) – Number of rows of the 2D array.

  • n (int) – Number of columns of the 2D array.

  • n_records (numpy.ndarray) – Number of records.

  • sums (numpy.ndarray) – Target sums.

  • stds (numpy.ndarray) – Target stds.

  • D (numpy.ndarray) – Mean 2D array.

  • P (numpy-ndarray) – Bin indices 2D array.

Warning

This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property binning_table available in all optimal binning classes.

analysis(print_output=True)

Binning table analysis.

Statistical analysis of the binning table, computing the Information Value (IV) and Herfindahl-Hirschman Index (HHI).

Parameters

print_output (bool (default=True)) – Whether to print analysis information.

Notes

The IV for a continuous target is computed as follows:

\[IV = \sum_{i=1}^n |U_i - \mu| \frac{r_i}{r_T},\]

where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean, \(r_i\) is the number of records for each bin, and \(r_T\) is the total number of records.

build(show_digits=2, show_bin_xy=False, add_totals=True)

Build the binning table.

Parameters
  • show_digits (int, optional (default=2)) – The number of significant digits of the bin column.

  • show_bin_xy (bool (default=False)) – Whether to show a single bin column with x and y.

  • add_totals (bool (default=True)) – Whether to add a last row with totals.

Returns

binning_table

Return type

pandas.DataFrame

property iv

The Information Value (IV).

The IV ranges from 0 to Infinity.

Returns

iv

Return type

float

plot(savefig=None)

Plot the binning table.

Visualize the mean for each bin as a matrix, and the x and y trend.

Parameters

savefig (str or None (default=None)) – Path to save the plot figure.

property quality_score

The quality score (QS).

The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.

Returns

quality_score

Return type

float

property woe

The sum of absolute WoEs.

This metric is computed as follows:

\[WoE = \sum_{i=1}^n |U_i - \mu|,\]

where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean.

Returns

woe

Return type

float