Transforming categorical features to numerical features

CatBoost supports the following types of features:
  • Numerical. Examples are the height (“182”, “173”), or any binary feature (“0”, “1”).

  • Categorical (cat). Such features can take one of a limited number of possible values. These values are usually fixed. Examples are the musical genre (“rock”, “indie”, “pop”) and the musical style (“dance”, “classical”).

Before each split is selected in the tree (see Choosing the tree structure), categorical features are transformed to numerical. This is done using various statistics on combinations of categorical features and combinations of categorical and numerical features.

The method of transforming categorical features to numerical generally includes the following stages:
  1. Permutating the set of input objects in a random order.

  2. Converting the label value from a floating point to an integer.

    The method depends on the machine learning problem being solved (which is determined by the selected loss function).
    ProblemHow transformation is performed
    RegressionBinarization is performed on the label value. The mode and number of buckets () are set in the starting parameters. All values located inside a single bucket are assigned a label value class – an integer in the range defined by the formula: <bucket ID – 1>.
    Classification

    Possible values for label value are “0” (doesn't belong to the specified target class) and “1” (belongs to the specified target class).

    MulticlassificationThe label values are integer identifiers of target classes (starting from “0”).
  3. Transforming categorical features to numerical features.

    The method is determined by the starting parameters.

    TypeFormula
    Borders

    Calculating ctr for the i-th bucket ():

    • countInClass is how many times the label value exceeded for objects with the current categorical feature value. It only counts objects that already have this value calculated (calculations are made in the order of the objects after shuffling).
    • totalCount is the total number of objects (up to the current one) that have a feature value matching the current one.
    • prior is a number (constant) defined by the starting parameters.
    Buckets

    Calculating ctr for the i-th bucket (, creates buckets):

    • countInClass is how many times the label value was equal to for objects with the current categorical feature value. It only counts objects that already have this value calculated (calculations are made in the order of the objects after shuffling).
    • totalCount is the total number of objects (up to the current one) that have a feature value matching the current one.
    • prior is a number (constant) defined by the starting parameters.
    BinarizedTargetMeanValue

    How ctr is calculated:

    • countInClass is the ratio of the sum of the label value integers for this categorical feature to the maximum label value integer ().
    • totalCount is the total number of objects that have a feature value matching the current one.
    • prior is a number (constant) defined by the starting parameters.
    Counter

    How ctr is calculated for the training dataset:

    • curCount is the total number of objects in the training dataset with the current categorical feature value.
    • maxCount the number of objects in the training dataset with the most frequent feature value.
    • prior is a number (constant) defined by the starting parameters.

    How ctr is calculated for the test dataset:

    • curCount computing principles depend on the chosen calculation method:
      • Full — The sum of the total number of objects in the training dataset with the current categorical feature value and the number of objects in the test dataset with the current categorical feature value.
      • SkipTest — The total number of objects in the training dataset with the current categorical feature value
    • maxCount is the number of objects with the most frequent feature value in one of the combinations of the following sets depending on the chosen calculation method:
      • Full — The training and the test datasets.
      • SkipTest — The training dataset.
    • prior is a number (constant) defined by the starting parameters.
    Note. This ctr does not depend on the label value.

As a result, each categorical feature values or feature combination value is assigned a numerical feature.

Example of aggregating multiple features

Assume that the objects in the training set belong to two categorical features: the musical genre (“rock”, “indie”) and the musical style (“dance”, “classical”). These features can occur in different combinations. CatBoost can create a new feature that is a combination of those listed (“dance rock”, “classic rock”, “dance indie”, or “indie classical”). Any number of features can be combined.

Transforming categorical features to numerical features in classification

  1. CatBoost accepts a set of object properties and model values as input.

    The table below shows what the results of this stage look like.

    Object #...Function value
    1240...rock1
    2355...indie0
    3534...pop1
    4245...rock0
    5453...rock0
    6248...indie1
    7542...rock1
    ...
  2. The rows in the input file are randomly shuffled several times. Multiple random permutations are generated.

    The table below shows what the results of this stage look like.

    Object #...Function value
    1453...rock0
    2355...indie0
    3240...rock1
    4542...rock1
    5534...pop1
    6248...indie1
    7245...rock0
    ...
  3. All categorical feature values are transformed to numerical using the following formula:

    •  is how many times the label value was equal to “1” for objects with the current categorical feature value.
    •  is the preliminary value for the numerator. It is determined by the starting parameters.
    •  is the total number of objects (up to the current one) that have a categorical feature value matching the current one.
    Note. These values are calculated individually for each object using data from previous objects.

    In the example with musical genres, accepts the values “rock”, “pop”, and “indie”, and prior is set to 0.05.

    The table below shows what the results of this stage look like.

    Object #...Function value
    1453...0,050
    2355...0,050
    3240...0,0251
    4542...0,351
    5534...0,051
    6248...0,0251
    7245...0,51250
    ...
One-hot encoding is also supported. Use one of the following training parameters to enable it.
CLI parameterPython parameterR parameterDescription
--one-hot-max-sizeone_hot_max_sizeone_hot_max_size

Use one-hot encoding for all features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.