Binning 2D tables¶
Binning table 2D: binary target¶
-
class
optbinning.binning.multidimensional.binning_statistics_2d.
BinningTable2D
(name_x, name_y, dtype_x, dtype_y, splits_x, splits_y, m, n, n_nonevent, n_event, D, P, categories_x=None, categories_y=None)¶ Bases:
optbinning.binning.binning_statistics.BinningTable
Binning table to summarize optimal binning of two numerical variables with respect to a binary target.
- Parameters
name_x (str, optional (default="")) – The name of variable x.
name_y (str, optional (default="")) – The name of variable y.
dtype_x (str, optional (default="numerical")) – The data type of variable x. Supported data type is “numerical” for continuous and ordinal variables.
dtype_y (str, optional (default="numerical")) – The data type of variable y. Supported data type is “numerical” for continuous and ordinal variables.
splits_x (numpy.ndarray) – List of split points for variable x.
splits_y (numpy.ndarray) – List of split points for variable y.
m (int) – Number of rows of the 2D array.
n (int) – Number of columns of the 2D array.
n_nonevent (numpy.ndarray) – Number of non-events.
n_event (numpy.ndarray) – Number of events.
D (numpy.ndarray) – Event rate 2D array.
P (numpy.ndarray) – Bin indices 2D array.
categories_x (numpy.ndarray or None (default=None)) – List of categories in variable x.
categories_y (numpy.ndarray or None (default=None)) – List of categories in variable y.
Warning
This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property
binning_table
available in all optimal binning classes.-
analysis
(pvalue_test='chi2', n_samples=100, print_output=True)¶ Binning table analysis.
Statistical analysis of the binning table, computing the statistics Gini index, Information Value (IV), Jensen-Shannon divergence, and the quality score. Additionally, several statistical significance tests between consecutive bins of the contingency table are performed: a frequentist test using the Chi-square test or the Fisher’s exact test, and a Bayesian A/B test using the beta distribution as a conjugate prior of the Bernoulli distribution.
- Parameters
pvalue_test (str, optional (default="chi2")) – The statistical test. Supported test are “chi2” to choose the Chi-square test and “fisher” to choose the Fisher exact test.
n_samples (int, optional (default=100)) – The number of samples to run the Bayesian A/B testing between consecutive bins to compute the probability of the event rate of bin A being greater than the event rate of bin B.
print_output (bool (default=True)) – Whether to print analysis information.
Notes
The Chi-square test uses scipy.stats.chi2_contingency, and the Fisher exact test uses scipy.stats.fisher_exact.
-
build
(show_digits=2, show_bin_xy=False, add_totals=True)¶ Build the binning table.
- Parameters
show_digits (int, optional (default=2)) – The number of significant digits of the bin column.
show_bin_xy (bool (default=False)) – Whether to show a single bin column with x and y.
add_totals (bool (default=True)) – Whether to add a last row with totals.
- Returns
binning_table
- Return type
pandas.DataFrame
-
property
gini
¶ The Gini coefficient or Accuracy Ratio.
The Gini coefficient is a quantitative measure of the discriminatory and predictive power of a variable. The Gini coefficient ranges from 0 to 1.
- Returns
gini
- Return type
float
-
property
hellinger
¶ The Hellinger divergence.
- Returns
hellinger
- Return type
float
-
property
iv
¶ The Information Value (IV) or Jeffrey’s divergence measure.
The IV ranges from 0 to Infinity.
- Returns
iv
- Return type
float
-
property
js
¶ The Jensen-Shannon divergence measure (JS).
The JS ranges from 0 to \(\log(2)\).
- Returns
js
- Return type
float
-
property
ks
¶ The Kolmogorov-Smirnov statistic.
- Returns
ks
- Return type
float
-
plot
(metric='woe', savefig=None)¶ Plot the binning table.
Visualize the Weight of Evidence or the event rate for each bin as a matrix, and the x and y trend.
- Parameters
metric (str, optional (default="woe")) – Supported metrics are “woe” to show the Weight of Evidence (WoE) measure and “event_rate” to show the event rate.
savefig (str or None (default=None)) – Path to save the plot figure.
-
property
quality_score
¶ The quality score (QS).
The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.
- Returns
quality_score
- Return type
float
-
property
triangular
¶ The triangular divergence.
- Returns
triangular
- Return type
float
Binning table 2D: continuous target¶
-
class
optbinning.binning.multidimensional.binning_statistics_2d.
ContinuousBinningTable2D
(name_x, name_y, dtype_x, dtype_y, splits_x, splits_y, m, n, n_records, sums, stds, D, P, categories_x=None, categories_y=None)¶ Bases:
optbinning.binning.binning_statistics.ContinuousBinningTable
Binning table to summarize optimal binning of two numerical variables with respect to a binary target.
- Parameters
name_x (str, optional (default="")) – The name of variable x.
name_y (str, optional (default="")) – The name of variable y.
dtype_x (str, optional (default="numerical")) – The data type of variable x. Supported data type is “numerical” for continuous and ordinal variables.
dtype_y (str, optional (default="numerical")) – The data type of variable y. Supported data type is “numerical” for continuous and ordinal variables.
splits_x (numpy.ndarray) – List of split points for variable x.
splits_y (numpy.ndarray) – List of split points for variable y.
m (int) – Number of rows of the 2D array.
n (int) – Number of columns of the 2D array.
n_records (numpy.ndarray) – Number of records.
sums (numpy.ndarray) – Target sums.
stds (numpy.ndarray) – Target stds.
D (numpy.ndarray) – Mean 2D array.
P (numpy-ndarray) – Bin indices 2D array.
Warning
This class is not intended to be instantiated by the user. It is preferable to use the class returned by the property
binning_table
available in all optimal binning classes.-
analysis
(print_output=True)¶ Binning table analysis.
Statistical analysis of the binning table, computing the Information Value (IV) and Herfindahl-Hirschman Index (HHI).
- Parameters
print_output (bool (default=True)) – Whether to print analysis information.
Notes
The IV for a continuous target is computed as follows:
\[IV = \sum_{i=1}^n |U_i - \mu| \frac{r_i}{r_T},\]where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean, \(r_i\) is the number of records for each bin, and \(r_T\) is the total number of records.
-
build
(show_digits=2, show_bin_xy=False, add_totals=True)¶ Build the binning table.
- Parameters
show_digits (int, optional (default=2)) – The number of significant digits of the bin column.
show_bin_xy (bool (default=False)) – Whether to show a single bin column with x and y.
add_totals (bool (default=True)) – Whether to add a last row with totals.
- Returns
binning_table
- Return type
pandas.DataFrame
-
property
iv
¶ The Information Value (IV).
The IV ranges from 0 to Infinity.
- Returns
iv
- Return type
float
-
plot
(savefig=None)¶ Plot the binning table.
Visualize the mean for each bin as a matrix, and the x and y trend.
- Parameters
savefig (str or None (default=None)) – Path to save the plot figure.
-
property
quality_score
¶ The quality score (QS).
The QS is a rating of the quality and discriminatory power of a variable. The QS ranges from 0 to 1.
- Returns
quality_score
- Return type
float
-
property
woe
¶ The sum of absolute WoEs.
This metric is computed as follows:
\[WoE = \sum_{i=1}^n |U_i - \mu|,\]where \(U_i\) is the target mean value for each bin, \(\mu\) is the total target mean.
- Returns
woe
- Return type
float