Utilities¶
Pre-binning¶
-
class
optbinning.binning.prebinning.
PreBinning
(problem_type, method, n_bins, min_bin_size, class_weight=None, **kwargs)¶ Bases:
object
Prebinning algorithms.
- Parameters
problem_type – The problem type depending on the target type.
method (str) – Available methods are ‘uniform’, ‘quantile’ and ‘cart’.
n_bins (int) – The number of bins to produce.
min_bin_size (int, float) – The minimum bin size.
**kwargs (keyword arguments) – Keyword arguments for prebinning method. See notes.
Notes
Keyword arguments are those available in the following classes:
-
fit
(x, y, sample_weight=None)¶ Fit PreBinning algorithm.
- Parameters
x (array-like, shape = (n_samples)) – Data samples, where n_samples is the number of samples.
y (array-like, shape = (n_samples)) – Target vector relative to x.
sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples.
- Returns
self
- Return type
-
property
splits
¶ List of split points
- Returns
splits
- Return type
numpy.ndarray
Transformations¶
The Weight of Evidence \(\text{WoE}_i\) and event rate \(D_i\) for each bin are related by means of the functional equations
where \(D_i\) can be characterized as a logistic function of \(\text{WoE}_i\), and \(\text{WoE}_i\) can be expressed in terms of the logit function of \(D_i\). The constant term \(\log(N_T^{E} / N_T^{NE})\) is the log ratio of the total number of event \(N_T^{E}\) and the total number of non-events \(N_T^{NE}\). This shows that WoE is inversely related to the event rate.
-
optbinning.binning.transformations.
transform_event_rate_to_woe
(event_rate, n_nonevent, n_event)¶ Transform event rate to WoE.
- Parameters
event_rate (array-like or float) – Event rate.
n_nonevent (int) – Total number of non-events.
n_event (int) – Total number of events.
- Returns
woe – Weight of evidence.
- Return type
numpy.ndarray or float
-
optbinning.binning.transformations.
transform_woe_to_event_rate
(woe, n_nonevent, n_event)¶ Transform WoE to event rate.
- Parameters
woe (array-like or float) – Weight of evidence.
n_nonevent (int) – Total number of non-events.
n_event (int) – Total number of events.
- Returns
event_rate – Event rate.
- Return type
numpy.ndarray or float
Metrics¶
Gini coefficient¶
The Gini coefficient or Accuracy Ratio is a quantitative measure of discriminatory and predictive power given a distribution of events and non-events. The Gini coefficient ranges from 0 to 1, and is defined by
where \(N_i^{E}\) and \(N_i^{NE}\) are the number of events and non-events per bin, respectively, and \(N_T^{E}\) and \(N_T^{NE}\) are the total number of events and non-events, respectively.
-
optbinning.binning.metrics.
gini
(event, nonevent)¶ Calculate the Gini coefficient given the number of events and non-events.
- Parameters
event (array-like) – Number of events.
nonevent (array-like) – Number of non-events.
- Returns
gini
- Return type
float
Divergence measures¶
Given two discrete probability distributions \(P\) and \(Q\). The Shannon entropy is defined as
The Kullback-Leibler divergence, denoted as \(D_{KL}(P||Q)\), is given by
The Jeffrey’s divergence or Information Value (IV), is a symmetric measure expressible in terms of the Kullback-Leibler divergence defined by
The Jensen-Shannon divergence is a bounded symmetric measure also expressible in terms of the Kullback-Leibler divergence
and bounded by \(JSD(P||Q) \in [0, \log(2)]\). We note that these measures cannot be directly used whenever \(p_i = 0\) and/or \(q_i = 0\).
-
optbinning.binning.metrics.
entropy
(x)¶ Calculate the entropy of a discrete probability distribution.
- Parameters
x (array-like) – Discrete probability distribution.
- Returns
entropy
- Return type
float
-
optbinning.binning.metrics.
kullback_leibler
(x, y, return_sum=False)¶ Calculate the Kullback-Leibler divergence between two distributions.
- Parameters
x (array-like) – Discrete probability distribution.
y (array-like) – Discrete probability distribution.
return_sum (bool) – Return sum of kullback-leibler values.
- Returns
kullback_leibler
- Return type
float or numpy.ndarray
-
optbinning.binning.metrics.
jeffrey
(x, y, return_sum=False)¶ Calculate the Jeffrey’s divergence between two distributions.
- Parameters
x (array-like) – Discrete probability distribution.
y (array-like) – Discrete probability distribution.
return_sum (bool) – Return sum of jeffrey values.
- Returns
jeffrey
- Return type
float or numpy.ndarray
-
optbinning.binning.metrics.
jensen_shannon
(x, y, return_sum=False)¶ Calculate the Jensen-Shannon divergence between two distributions.
- Parameters
x (array-like) – Discrete probability distribution.
y (array-like) – Discrete probability distribution.
return_sum (bool) – Return sum of jensen shannon values.
- Returns
jensen_shannon
- Return type
float or numpy.ndarray
-
optbinning.binning.metrics.
jensen_shannon_multivariate
(X, weights=None)¶ Calculate Jensen-Shannon divergence between several distributions.
- Parameters
X (array-like, shape = (n_samples, n_distributions)) – Discrete probability distributions.
weights (array-like, shape = (n_distributions)) – Array of weights associated with the distributions. If None all distributions are assumed to have equal weight.
- Returns
jensen_shannon
- Return type
float
-
optbinning.binning.metrics.
hellinger
(x, y, return_sum=False)¶ Calculate the Hellinger discrimination between two distributions.
- Parameters
x (array-like) – Discrete probability distribution.
y (array-like) – Discrete probability distribution.
return_sum (bool) – Return sum of jensen shannon values.
- Returns
hellinger
- Return type
float or numpy.ndarray
-
optbinning.binning.metrics.
triangular
(x, y, return_sum=False)¶ Calculate the LeCam or triangular discrimination between two distributions.
- Parameters
x (array-like) – Discrete probability distribution.
y (array-like) – Discrete probability distribution.
return_sum (bool) – Return sum of jensen shannon values.
- Returns
triangular
- Return type
float or numpy.ndarray