MDLP discretization algorithm

class optbinning.MDLP(min_samples_split=2, min_samples_leaf=2, max_candidates=32)

Bases: sklearn.base.BaseEstimator

Minimum Description Length Principle (MDLP) discretization algorithm.

Parameters
  • min_samples_split (int (default=2)) – The minimum number of samples required to split an internal node.

  • min_samples_leaf (int (default=2)) – The minimum number of samples required to be at a leaf node.

  • max_candidates (int (default=32)) – The maximum number of split points to evaluate at each partition.

Notes

Implementation of the discretization algorithm in [FI93]. A dynamic split strategy based on binning the number of candidate splits [CMR2001] is implemented to increase efficiency. For large size datasets, it is recommended to use a smaller max_candidates (e.g. 16) to get a significant speed up.

References

FI93

U. M. Fayyad and K. B. Irani. “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning”. International Joint Conferences on Artificial Intelligence, 13:1022–1027, 1993.

CMR2001

D. M. Chickering, C. Meek and R. Rounthwaite. “Efficient Determination of Dynamic Split Points in a Decision Tree”. In Proceedings of the 2001 IEEE International Conference on Data Mining, 91-98, 2001.

fit(x, y)

Fit MDLP discretization algorithm.

Parameters
  • x (array-like, shape = (n_samples)) – Data samples, where n_samples is the number of samples.

  • y (array-like, shape = (n_samples)) – Target vector relative to x.

Returns

self

Return type

MDLP

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

property splits

List of split points

Returns

splits

Return type

numpy.ndarray