# Utilities¶

## Pre-binning¶

class optbinning.binning.prebinning.PreBinning(problem_type, method, n_bins, min_bin_size, class_weight=None, **kwargs)

Bases: object

Prebinning algorithms.

Parameters
• problem_type – The problem type depending on the target type.

• method (str) – Available methods are ‘uniform’, ‘quantile’ and ‘cart’.

• n_bins (int) – The number of bins to produce.

• min_bin_size (int, float) – The minimum bin size.

• **kwargs (keyword arguments) – Keyword arguments for prebinning method. See notes.

Notes

Keyword arguments are those available in the following classes:

• method="uniform": sklearn.preprocessing.KBinsDiscretizer.

• method="quantile": sklearn.preprocessing.KBinsDiscretizer.

• method="cart": sklearn.tree.DecistionTreeClassifier.

• method="mdlp": optbinning.binning.mdlp.MDLP.

fit(x, y, sample_weight=None)

Fit PreBinning algorithm.

Parameters
• x (array-like, shape = (n_samples)) – Data samples, where n_samples is the number of samples.

• y (array-like, shape = (n_samples)) – Target vector relative to x.

• sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples.

Returns

self

Return type

PreBinning

property splits

List of split points

Returns

splits

Return type

numpy.ndarray

## Transformations¶

The Weight of Evidence $$\text{WoE}_i$$ and event rate $$D_i$$ for each bin are related by means of the functional equations

\begin{split}\begin{align} \text{WoE}_i &= \log\left(\frac{1 - D_i}{D_i}\right) + \log\left(\frac{N_T^{E}}{N_T^{NE}}\right) = \log\left(\frac{N_T^{E}}{N_T^{NE}}\right) - \text{logit}(D_i)\\ D_i &= \left(1 + \frac{N_T^{NE}}{N_T^{E}} e^{\text{WoE}_i}\right)^{-1} = \left(1 + e^{\text{WoE}_i - \log\left(\frac{N_T^{E}}{N_T^{NE}}\right)}\right)^{-1}, \end{align}\end{split}

where $$D_i$$ can be characterized as a logistic function of $$\text{WoE}_i$$, and $$\text{WoE}_i$$ can be expressed in terms of the logit function of $$D_i$$. The constant term $$\log(N_T^{E} / N_T^{NE})$$ is the log ratio of the total number of event $$N_T^{E}$$ and the total number of non-events $$N_T^{NE}$$. This shows that WoE is inversely related to the event rate.

optbinning.binning.transformations.transform_event_rate_to_woe(event_rate, n_nonevent, n_event)

Transform event rate to WoE.

Parameters
• event_rate (array-like or float) – Event rate.

• n_nonevent (int) – Total number of non-events.

• n_event (int) – Total number of events.

Returns

woe – Weight of evidence.

Return type

numpy.ndarray or float

optbinning.binning.transformations.transform_woe_to_event_rate(woe, n_nonevent, n_event)

Transform WoE to event rate.

Parameters
• woe (array-like or float) – Weight of evidence.

• n_nonevent (int) – Total number of non-events.

• n_event (int) – Total number of events.

Returns

event_rate – Event rate.

Return type

numpy.ndarray or float

## Metrics¶

### Gini coefficient¶

The Gini coefficient or Accuracy Ratio is a quantitative measure of discriminatory and predictive power given a distribution of events and non-events. The Gini coefficient ranges from 0 to 1, and is defined by

$Gini = 1 - \frac{2 \sum_{i=2}^n \left(N_i^{E} \sum_{j=1}^{i-1} N_j^{NE}\right) + \sum_{k=1}^n N_k^{E} N_k^{NE}}{N_T^{E} N_T^{NE}},$

where $$N_i^{E}$$ and $$N_i^{NE}$$ are the number of events and non-events per bin, respectively, and $$N_T^{E}$$ and $$N_T^{NE}$$ are the total number of events and non-events, respectively.

optbinning.binning.metrics.gini(event, nonevent)

Calculate the Gini coefficient given the number of events and non-events.

Parameters
• event (array-like) – Number of events.

• nonevent (array-like) – Number of non-events.

Returns

gini

Return type

float

### Divergence measures¶

Given two discrete probability distributions $$P$$ and $$Q$$. The Shannon entropy is defined as

$S(P) = - \sum_{i=1}^n p_i \log(p_i).$

The Kullback-Leibler divergence, denoted as $$D_{KL}(P||Q)$$, is given by

$D_{KL}(P || Q) = \sum_{i=1}^n p_i \log \left(\frac{p_i}{q_i}\right).$

The Jeffrey’s divergence or Information Value (IV), is a symmetric measure expressible in terms of the Kullback-Leibler divergence defined by

\begin{split}\begin{align*} J(P|| Q) &= D_{KL}(P || Q) + D_{KL}(Q || P) = \sum_{i=1}^n p_i \log \left(\frac{p_i}{q_i}\right) + \sum_{i=1}^n q_i \log \left(\frac{q_i}{p_i}\right)\\ &= \sum_{i=1}^n (p_i - q_i) \log \left(\frac{p_i}{q_i}\right). \end{align*}\end{split}

The Jensen-Shannon divergence is a bounded symmetric measure also expressible in terms of the Kullback-Leibler divergence

$$$JSD(P || Q) = \frac{1}{2}\left(D(P || M) + D(Q || M)\right), \quad M = \frac{1}{2}(P + Q),$$$

and bounded by $$JSD(P||Q) \in [0, \log(2)]$$. We note that these measures cannot be directly used whenever $$p_i = 0$$ and/or $$q_i = 0$$.

optbinning.binning.metrics.entropy(x)

Calculate the entropy of a discrete probability distribution.

Parameters

x (array-like) – Discrete probability distribution.

Returns

entropy

Return type

float

optbinning.binning.metrics.kullback_leibler(x, y, return_sum=False)

Calculate the Kullback-Leibler divergence between two distributions.

Parameters
• x (array-like) – Discrete probability distribution.

• y (array-like) – Discrete probability distribution.

• return_sum (bool) – Return sum of kullback-leibler values.

Returns

kullback_leibler

Return type

float or numpy.ndarray

optbinning.binning.metrics.jeffrey(x, y, return_sum=False)

Calculate the Jeffrey’s divergence between two distributions.

Parameters
• x (array-like) – Discrete probability distribution.

• y (array-like) – Discrete probability distribution.

• return_sum (bool) – Return sum of jeffrey values.

Returns

jeffrey

Return type

float or numpy.ndarray

optbinning.binning.metrics.jensen_shannon(x, y, return_sum=False)

Calculate the Jensen-Shannon divergence between two distributions.

Parameters
• x (array-like) – Discrete probability distribution.

• y (array-like) – Discrete probability distribution.

• return_sum (bool) – Return sum of jensen shannon values.

Returns

jensen_shannon

Return type

float or numpy.ndarray

optbinning.binning.metrics.jensen_shannon_multivariate(X, weights=None)

Calculate Jensen-Shannon divergence between several distributions.

Parameters
• X (array-like, shape = (n_samples, n_distributions)) – Discrete probability distributions.

• weights (array-like, shape = (n_distributions)) – Array of weights associated with the distributions. If None all distributions are assumed to have equal weight.

Returns

jensen_shannon

Return type

float

optbinning.binning.metrics.hellinger(x, y, return_sum=False)

Calculate the Hellinger discrimination between two distributions.

Parameters
• x (array-like) – Discrete probability distribution.

• y (array-like) – Discrete probability distribution.

• return_sum (bool) – Return sum of jensen shannon values.

Returns

hellinger

Return type

float or numpy.ndarray

optbinning.binning.metrics.triangular(x, y, return_sum=False)

Calculate the LeCam or triangular discrimination between two distributions.

Parameters
• x (array-like) – Discrete probability distribution.

• y (array-like) – Discrete probability distribution.

• return_sum (bool) – Return sum of jensen shannon values.

Returns

triangular

Return type

float or numpy.ndarray