Utilities

Pre-binning

class optbinning.binning.prebinning.PreBinning(problem_type, method, n_bins, min_bin_size, class_weight=None, **kwargs)

Bases: object

Prebinning algorithms.

Parameters
  • problem_type – The problem type depending on the target type.

  • method (str) – Available methods are ‘uniform’, ‘quantile’ and ‘cart’.

  • n_bins (int) – The number of bins to produce.

  • min_bin_size (int, float) – The minimum bin size.

  • **kwargs (keyword arguments) – Keyword arguments for prebinning method. See notes.

Notes

Keyword arguments are those available in the following classes:

  • method="uniform": `sklearn.preprocessing.KBinsDiscretizer.

  • method="quantile": `sklearn.preprocessing.KBinsDiscretizer.

  • method="cart": sklearn.tree.DecistionTreeClassifier.

  • method="mdlp": optbinning.binning.mdlp.MDLP.

fit(x, y, sample_weight=None)

Fit PreBinning algorithm.

Parameters
  • x (array-like, shape = (n_samples)) – Data samples, where n_samples is the number of samples.

  • y (array-like, shape = (n_samples)) – Target vector relative to x.

  • sample_weight (array-like of shape (n_samples,) (default=None)) – Array of weights that are assigned to individual samples.

Returns

self

Return type

PreBinning

property splits

List of split points

Returns

splits

Return type

numpy.ndarray

Transformations

The Weight of Evidence \(\text{WoE}_i\) and event rate \(D_i\) for each bin are related by means of the functional equations

\[\begin{split}\begin{align} \text{WoE}_i &= \log\left(\frac{1 - D_i}{D_i}\right) + \log\left(\frac{N_T^{E}}{N_T^{NE}}\right) = \log\left(\frac{N_T^{E}}{N_T^{NE}}\right) - \text{logit}(D_i)\\ D_i &= \left(1 + \frac{N_T^{NE}}{N_T^{E}} e^{\text{WoE}_i}\right)^{-1} = \left(1 + e^{\text{WoE}_i - \log\left(\frac{N_T^{E}}{N_T^{NE}}\right)}\right)^{-1}, \end{align}\end{split}\]

where \(D_i\) can be characterized as a logistic function of \(\text{WoE}_i\), and \(\text{WoE}_i\) can be expressed in terms of the logit function of \(D_i\). The constant term \(\log(N_T^{E} / N_T^{NE})\) is the log ratio of the total number of event \(N_T^{E}\) and the total number of non-events \(N_T^{NE}\). This shows that WoE is inversely related to the event rate.

optbinning.binning.transformations.transform_event_rate_to_woe(event_rate, n_nonevent, n_event)

Transform event rate to WoE.

Parameters
  • event_rate (array-like or float) – Event rate.

  • n_nonevent (int) – Total number of non-events.

  • n_event (int) – Total number of events.

Returns

woe – Weight of evidence.

Return type

numpy.ndarray or float

optbinning.binning.transformations.transform_woe_to_event_rate(woe, n_nonevent, n_event)

Transform WoE to event rate.

Parameters
  • woe (array-like or float) – Weight of evidence.

  • n_nonevent (int) – Total number of non-events.

  • n_event (int) – Total number of events.

Returns

event_rate – Event rate.

Return type

numpy.ndarray or float

Metrics

Gini coefficient

The Gini coefficient or Accuracy Ratio is a quantitative measure of discriminatory and predictive power given a distribution of events and non-events. The Gini coefficient ranges from 0 to 1, and is defined by

\[Gini = 1 - \frac{2 \sum_{i=2}^n \left(N_i^{E} \sum_{j=1}^{i-1} N_j^{NE}\right) + \sum_{k=1}^n N_k^{E} N_k^{NE}}{N_T^{E} N_T^{NE}},\]

where \(N_i^{E}\) and \(N_i^{NE}\) are the number of events and non-events per bin, respectively, and \(N_T^{E}\) and \(N_T^{NE}\) are the total number of events and non-events, respectively.

optbinning.binning.metrics.gini(event, nonevent)

Calculate the Gini coefficient given the number of events and non-events.

Parameters
  • event (array-like) – Number of events.

  • nonevent (array-like) – Number of non-events.

Returns

gini

Return type

float

Divergence measures

Given two discrete probability distributions \(P\) and \(Q\). The Shannon entropy is defined as

\[S(P) = - \sum_{i=1}^n p_i \log(p_i).\]

The Kullback-Leibler divergence, denoted as \(D_{KL}(P||Q)\), is given by

\[D_{KL}(P || Q) = \sum_{i=1}^n p_i \log \left(\frac{p_i}{q_i}\right).\]

The Jeffrey’s divergence or Information Value (IV), is a symmetric measure expressible in terms of the Kullback-Leibler divergence defined by

\[\begin{split}\begin{align*} J(P|| Q) &= D_{KL}(P || Q) + D_{KL}(Q || P) = \sum_{i=1}^n p_i \log \left(\frac{p_i}{q_i}\right) + \sum_{i=1}^n q_i \log \left(\frac{q_i}{p_i}\right)\\ &= \sum_{i=1}^n (p_i - q_i) \log \left(\frac{p_i}{q_i}\right). \end{align*}\end{split}\]

The Jensen-Shannon divergence is a bounded symmetric measure also expressible in terms of the Kullback-Leibler divergence

\[\begin{equation} JSD(P || Q) = \frac{1}{2}\left(D(P || M) + D(Q || M)\right), \quad M = \frac{1}{2}(P + Q), \end{equation}\]

and bounded by \(JSD(P||Q) \in [0, \log(2)]\). We note that these measures cannot be directly used whenever \(p_i = 0\) and/or \(q_i = 0\).

optbinning.binning.metrics.entropy(x)

Calculate the entropy of a discrete probability distribution.

Parameters

x (array-like) – Discrete probability distribution.

Returns

entropy

Return type

float

optbinning.binning.metrics.kullback_leibler(x, y, return_sum=False)

Calculate the Kullback-Leibler divergence between two distributions.

Parameters
  • x (array-like) – Discrete probability distribution.

  • y (array-like) – Discrete probability distribution.

  • return_sum (bool) – Return sum of kullback-leibler values.

Returns

kullback_leibler

Return type

float or numpy.ndarray

optbinning.binning.metrics.jeffrey(x, y, return_sum=False)

Calculate the Jeffrey’s divergence between two distributions.

Parameters
  • x (array-like) – Discrete probability distribution.

  • y (array-like) – Discrete probability distribution.

  • return_sum (bool) – Return sum of jeffrey values.

Returns

jeffrey

Return type

float or numpy.ndarray

optbinning.binning.metrics.jensen_shannon(x, y, return_sum=False)

Calculate the Jensen-Shannon divergence between two distributions.

Parameters
  • x (array-like) – Discrete probability distribution.

  • y (array-like) – Discrete probability distribution.

  • return_sum (bool) – Return sum of jensen shannon values.

Returns

jensen_shannon

Return type

float or numpy.ndarray

optbinning.binning.metrics.jensen_shannon_multivariate(X, weights=None)

Calculate Jensen-Shannon divergence between several distributions.

Parameters
  • X (array-like, shape = (n_samples, n_distributions)) – Discrete probability distributions.

  • weights (array-like, shape = (n_distributions)) – Array of weights associated with the distributions. If None all distributions are assumed to have equal weight.

Returns

jensen_shannon

Return type

float

optbinning.binning.metrics.hellinger(x, y, return_sum=False)

Calculate the Hellinger discrimination between two distributions.

Parameters
  • x (array-like) – Discrete probability distribution.

  • y (array-like) – Discrete probability distribution.

  • return_sum (bool) – Return sum of jensen shannon values.

Returns

hellinger

Return type

float or numpy.ndarray

optbinning.binning.metrics.triangular(x, y, return_sum=False)

Calculate the LeCam or triangular discrimination between two distributions.

Parameters
  • x (array-like) – Discrete probability distribution.

  • y (array-like) – Discrete probability distribution.

  • return_sum (bool) – Return sum of jensen shannon values.

Returns

triangular

Return type

float or numpy.ndarray