Utilities¶

Pre-binning¶

class optbinning.binning.prebinning.PreBinning(problem_type, method, n_bins, min_bin_size, class_weight=None, **kwargs)¶

Bases: object

Prebinning algorithms.

Parameters

problem_type – The problem type depending on the target type.
method (str) – Available methods are ‘uniform’, ‘quantile’ and ‘cart’.
n_bins (int) – The number of bins to produce.
min_bin_size (int, float) – The minimum bin size.
**kwargs (keyword arguments) – Keyword arguments for prebinning method. See notes.

Notes

Keyword arguments are those available in the following classes:

method="uniform": `sklearn.preprocessing.KBinsDiscretizer.

method="quantile": `sklearn.preprocessing.KBinsDiscretizer.

method="cart": sklearn.tree.DecistionTreeClassifier.

method="mdlp": optbinning.binning.mdlp.MDLP.

fit(x, y, sample_weight=None)¶

Fit PreBinning algorithm.

Parameters

x (array-like, shape = (n_samples)) – Data samples, where n_samples is the number of samples.
y (array-like, shape = (n_samples)) – Target vector relative to x.
sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples.

Returns

self

Return type

PreBinning

property splits¶

List of split points

Returns: splits
Return type: numpy.ndarray

Transformations¶

The Weight of Evidence \(\text{WoE}_i\) and event rate \(D_i\) for each bin are related by means of the functional equations

\[\begin{split}\begin{align} \text{WoE}_i &= \log\left(\frac{1 - D_i}{D_i}\right) + \log\left(\frac{N_T^{E}}{N_T^{NE}}\right) = \log\left(\frac{N_T^{E}}{N_T^{NE}}\right) - \text{logit}(D_i)\\ D_i &= \left(1 + \frac{N_T^{NE}}{N_T^{E}} e^{\text{WoE}_i}\right)^{-1} = \left(1 + e^{\text{WoE}_i - \log\left(\frac{N_T^{E}}{N_T^{NE}}\right)}\right)^{-1}, \end{align}\end{split}\]

where \(D_i\) can be characterized as a logistic function of \(\text{WoE}_i\), and \(\text{WoE}_i\) can be expressed in terms of the logit function of \(D_i\). The constant term \(\log(N_T^{E} / N_T^{NE})\) is the log ratio of the total number of event \(N_T^{E}\) and the total number of non-events \(N_T^{NE}\). This shows that WoE is inversely related to the event rate.

optbinning.binning.transformations.transform_event_rate_to_woe(event_rate, n_nonevent, n_event)¶

Transform event rate to WoE.

Parameters

event_rate (array-like or float) – Event rate.
n_nonevent (int) – Total number of non-events.
n_event (int) – Total number of events.

Returns

woe – Weight of evidence.

Return type

numpy.ndarray or float

optbinning.binning.transformations.transform_woe_to_event_rate(woe, n_nonevent, n_event)¶

Transform WoE to event rate.

Parameters

woe (array-like or float) – Weight of evidence.
n_nonevent (int) – Total number of non-events.
n_event (int) – Total number of events.

Returns

event_rate – Event rate.

Return type

numpy.ndarray or float

Metrics¶

Gini coefficient¶

The Gini coefficient or Accuracy Ratio is a quantitative measure of discriminatory and predictive power given a distribution of events and non-events. The Gini coefficient ranges from 0 to 1, and is defined by

\[Gini = 1 - \frac{2 \sum_{i=2}^n \left(N_i^{E} \sum_{j=1}^{i-1} N_j^{NE}\right) + \sum_{k=1}^n N_k^{E} N_k^{NE}}{N_T^{E} N_T^{NE}},\]

where \(N_i^{E}\) and \(N_i^{NE}\) are the number of events and non-events per bin, respectively, and \(N_T^{E}\) and \(N_T^{NE}\) are the total number of events and non-events, respectively.

optbinning.binning.metrics.gini(event, nonevent)¶

Calculate the Gini coefficient given the number of events and non-events.

Parameters

event (array-like) – Number of events.
nonevent (array-like) – Number of non-events.

Returns

gini

Return type

float

Divergence measures¶

Given two discrete probability distributions \(P\) and \(Q\). The Shannon entropy is defined as

\[S(P) = - \sum_{i=1}^n p_i \log(p_i).\]

The Kullback-Leibler divergence, denoted as \(D_{KL}(P||Q)\), is given by

\[D_{KL}(P || Q) = \sum_{i=1}^n p_i \log \left(\frac{p_i}{q_i}\right).\]

The Jeffrey’s divergence or Information Value (IV), is a symmetric measure expressible in terms of the Kullback-Leibler divergence defined by

\[\begin{split}\begin{align*} J(P|| Q) &= D_{KL}(P || Q) + D_{KL}(Q || P) = \sum_{i=1}^n p_i \log \left(\frac{p_i}{q_i}\right) + \sum_{i=1}^n q_i \log \left(\frac{q_i}{p_i}\right)\\ &= \sum_{i=1}^n (p_i - q_i) \log \left(\frac{p_i}{q_i}\right). \end{align*}\end{split}\]

The Jensen-Shannon divergence is a bounded symmetric measure also expressible in terms of the Kullback-Leibler divergence

\[\begin{equation} JSD(P || Q) = \frac{1}{2}\left(D(P || M) + D(Q || M)\right), \quad M = \frac{1}{2}(P + Q), \end{equation}\]

and bounded by \(JSD(P||Q) \in [0, \log(2)]\). We note that these measures cannot be directly used whenever \(p_i = 0\) and/or \(q_i = 0\).

optbinning.binning.metrics.entropy(x)¶

Calculate the entropy of a discrete probability distribution.

Parameters: x (array-like) – Discrete probability distribution.
Returns: entropy
Return type: float

optbinning.binning.metrics.kullback_leibler(x, y, return_sum=False)¶

Calculate the Kullback-Leibler divergence between two distributions.

Parameters

x (array-like) – Discrete probability distribution.
y (array-like) – Discrete probability distribution.
return_sum (bool) – Return sum of kullback-leibler values.

Returns

kullback_leibler

Return type

float or numpy.ndarray

optbinning.binning.metrics.jeffrey(x, y, return_sum=False)¶

Calculate the Jeffrey’s divergence between two distributions.

Parameters

x (array-like) – Discrete probability distribution.
y (array-like) – Discrete probability distribution.
return_sum (bool) – Return sum of jeffrey values.

Returns

jeffrey

Return type

float or numpy.ndarray

optbinning.binning.metrics.jensen_shannon(x, y, return_sum=False)¶

Calculate the Jensen-Shannon divergence between two distributions.

Parameters

x (array-like) – Discrete probability distribution.
y (array-like) – Discrete probability distribution.
return_sum (bool) – Return sum of jensen shannon values.

Returns

jensen_shannon

Return type

float or numpy.ndarray

optbinning.binning.metrics.jensen_shannon_multivariate(X, weights=None)¶

Calculate Jensen-Shannon divergence between several distributions.

Parameters

X (array-like, shape = (n_samples, n_distributions)) – Discrete probability distributions.
weights (array-like, shape = (n_distributions)) – Array of weights associated with the distributions. If None all distributions are assumed to have equal weight.

Returns

jensen_shannon

Return type

float

optbinning.binning.metrics.hellinger(x, y, return_sum=False)¶

Calculate the Hellinger discrimination between two distributions.

Parameters

x (array-like) – Discrete probability distribution.
y (array-like) – Discrete probability distribution.
return_sum (bool) – Return sum of jensen shannon values.

Returns

hellinger

Return type

float or numpy.ndarray

optbinning.binning.metrics.triangular(x, y, return_sum=False)¶

Calculate the LeCam or triangular discrimination between two distributions.

Parameters

x (array-like) – Discrete probability distribution.
y (array-like) – Discrete probability distribution.
return_sum (bool) – Return sum of jensen shannon values.

Returns

triangular

Return type

float or numpy.ndarray