Binarization

Before learning, the possible values of objects are divided into disjoint ranges (buckets) delimited by the threshold values (splits). The size of the binarization (the number of splits) is determined by the starting parameters (separately for numerical features and numbers obtained as a result of converting categorical features into numerical features).

Binarization is also used to split the label values when working with categorical features. А random subset of the dataset is used for this purpose on large datasets.

The table below shows the binarization modes provided in CatBoost.

ModeHow splits are chosen
Median

Include an approximately equal number of objects in every bucket.

Uniform

Form buckets of equal size.

UniformAndQuantiles
Combine the splits obtained in the following modes, after first halving the binarization size provided by the starting parameters for each of them:
  • Median.
  • Uniform.
MaxLogSum

Maximize the value of the following expression inside each bucket:

  •  — The number of distinct objects in the bucket.
  •  — The number of times an object in the bucket is repeated.
MinEntropy

Minimize the value of the following expression inside each bucket:

  •  — The number of distinct objects in the bucket.
  •  — The number of times an object in the bucket is repeated.
GreedyLogSum

Maximize the greedy approximation of the following expression inside every bucket:

  •  — The number of distinct objects in the bucket.
  •  — The number of times an object in the bucket is repeated.