Binning 2D tables¶

Binning table 2D: binary target¶

class optbinning.binning.multidimensional.binning_statistics_2d.BinningTable2D(name_x, name_y, dtype_x, dtype_y, splits_x, splits_y, m, n, n_nonevent, n_event, D, P, categories_x=None, categories_y=None)¶

Bases: optbinning.binning.binning_statistics.BinningTable

Binning table to summarize optimal binning of two numerical variables with respect to a binary target.

Parameters

name_x (str, optional (default="")) – The name of variable x.
name_y (str, optional (default="")) – The name of variable y.
dtype_x (str, optional (default="numerical")) – The data type of variable x. Supported data type is “numerical” for continuous and ordinal variables.
dtype_y (str, optional (default="numerical")) – The data type of variable y. Supported data type is “numerical” for continuous and ordinal variables.
splits_x (numpy.ndarray) – List of split points for variable x.
splits_y (numpy.ndarray) – List of split points for variable y.
m (int) – Number of rows of the 2D array.
n (int) – Number of columns of the 2D array.
n_nonevent (numpy.ndarray) – Number of non-events.
n_event (numpy.ndarray) – Number of events.
D (numpy.ndarray) – Event rate 2D array.
P (numpy.ndarray) – Bin indices 2D array.
categories_x (numpy.ndarray or None (default=None)) – List of categories in variable x.
categories_y (numpy.ndarray or None (default=None)) – List of categories in variable y.

Warning

This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property binning_table available in all optimal binning classes.

analysis(pvalue_test='chi2', n_samples=100, print_output=True)¶

Binning table analysis.

Statistical analysis of the binning table, computing the statistics Gini index, Information Value (IV), Jensen-Shannon divergence, and the quality score. Additionally, several statistical significance tests between consecutive bins of the contingency table are performed: a frequentist test using the Chi-square test or the Fisher’s exact test, and a Bayesian A/B test using the beta distribution as a conjugate prior of the Bernoulli distribution.

Parameters

pvalue_test (str, optional (default="chi2")) – The statistical test. Supported test are “chi2” to choose the Chi-square test and “fisher” to choose the Fisher exact test.
n_samples (int, optional (default=100)) – The number of samples to run the Bayesian A/B testing between consecutive bins to compute the probability of the event rate of bin A being greater than the event rate of bin B.
print_output (bool (default=True)) – Whether to print analysis information.

Notes

The Chi-square test uses scipy.stats.chi2_contingency, and the Fisher exact test uses scipy.stats.fisher_exact.

build(show_digits=2, show_bin_xy=False, add_totals=True)¶

Build the binning table.

Parameters

show_digits (int, optional (default=2)) – The number of significant digits of the bin column.
show_bin_xy (bool (default=False)) – Whether to show a single bin column with x and y.
add_totals (bool (default=True)) – Whether to add a last row with totals.

Returns

binning_table

Return type

pandas.DataFrame

property gini¶

The Gini coefficient or Accuracy Ratio.

The Gini coefficient is a quantitative measure of the discriminatory and predictive power of a variable. The Gini coefficient ranges from 0 to 1.

Returns: gini
Return type: float

property hellinger¶

The Hellinger divergence.

Returns: hellinger
Return type: float

property iv¶

The Information Value (IV) or Jeffrey’s divergence measure.

The IV ranges from 0 to Infinity.

Returns: iv
Return type: float

property js¶

The Jensen-Shannon divergence measure (JS).

The JS ranges from 0 to \(\log(2)\).

Returns: js
Return type: float

property ks¶

The Kolmogorov-Smirnov statistic.

Returns: ks
Return type: float

plot(metric='woe', savefig=None, save_kwargs=None)¶

Plot the binning table.

Visualize the Weight of Evidence or the event rate for each bin as a matrix, and the x and y trend.

Parameters

metric (str, optional (default="woe")) – Supported metrics are “woe” to show the Weight of Evidence (WoE) measure and “event_rate” to show the event rate.
savefig (str or None (default=None)) – Path to save the plot figure.
save_kwargs (dict or None (default=None)) – Additional keyword arguments to be passed to plt.savefig.

property quality_score¶

The quality score (QS).

The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.

Returns: quality_score
Return type: float

property triangular¶

The triangular divergence.

Returns: triangular
Return type: float

Binning table 2D: continuous target¶

class optbinning.binning.multidimensional.binning_statistics_2d.ContinuousBinningTable2D(name_x, name_y, dtype_x, dtype_y, splits_x, splits_y, m, n, n_records, sums, stds, D, P, categories_x=None, categories_y=None)¶

Bases: optbinning.binning.binning_statistics.ContinuousBinningTable

Binning table to summarize optimal binning of two numerical variables with respect to a binary target.

Parameters

name_x (str, optional (default="")) – The name of variable x.
name_y (str, optional (default="")) – The name of variable y.
dtype_x (str, optional (default="numerical")) – The data type of variable x. Supported data type is “numerical” for continuous and ordinal variables.
dtype_y (str, optional (default="numerical")) – The data type of variable y. Supported data type is “numerical” for continuous and ordinal variables.
splits_x (numpy.ndarray) – List of split points for variable x.
splits_y (numpy.ndarray) – List of split points for variable y.
m (int) – Number of rows of the 2D array.
n (int) – Number of columns of the 2D array.
n_records (numpy.ndarray) – Number of records.
sums (numpy.ndarray) – Target sums.
stds (numpy.ndarray) – Target stds.
D (numpy.ndarray) – Mean 2D array.
P (numpy-ndarray) – Bin indices 2D array.

Warning

This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property binning_table available in all optimal binning classes.

analysis(print_output=True)¶

Binning table analysis.

Statistical analysis of the binning table, computing the Information Value (IV) and Herfindahl-Hirschman Index (HHI).

Parameters: print_output (bool (default=True)) – Whether to print analysis information.

Notes

The IV for a continuous target is computed as follows:

\[IV = \sum_{i=1}^n |U_i - \mu| \frac{r_i}{r_T},\]

where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean, \(r_i\) is the number of records for each bin, and \(r_T\) is the total number of records.

build(show_digits=2, show_bin_xy=False, add_totals=True)¶

Build the binning table.

Parameters

show_digits (int, optional (default=2)) – The number of significant digits of the bin column.
show_bin_xy (bool (default=False)) – Whether to show a single bin column with x and y.
add_totals (bool (default=True)) – Whether to add a last row with totals.

Returns

binning_table

Return type

pandas.DataFrame

property iv¶

The Information Value (IV).

The IV ranges from 0 to Infinity.

Returns: iv
Return type: float

plot(savefig=None)¶

Plot the binning table.

Visualize the mean for each bin as a matrix, and the x and y trend.

Parameters: savefig (str or None (default=None)) – Path to save the plot figure.

property quality_score¶

The quality score (QS).

The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.

Returns: quality_score
Return type: float

property woe¶

The sum of absolute WoEs.

This metric is computed as follows:

\[WoE = \sum_{i=1}^n |U_i - \mu|,\]

where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean.

Returns: woe
Return type: float