FeaturesData

class FeaturesData(num_feature_data=None,
                   cat_feature_data=None,
                   num_feature_names=None,
                   cat_feature_names=None)

Purpose

Allows to optimally store the feature data for further passing to the Pool constructor. The creation of pools from this representation is much faster than from generic numpy.ndarray, pandas.DataFrame or pandas.Series if the dataset contains both numerical and categorical features, most of which are numerical. Pass numpy.ndarray with numpy.float32 dtype to get similar performance with datasets that contain only numerical features.

Parameters

ParameterPossible typesDescriptionDefault value
num_feature_datanumpy.ndarray

Numerical features for all objects from the dataset in the form of numpy.ndarray of shape (object_count x num_feature_count) with dtype “numpy.float32”.

None (the dataset does not contain numerical features)
cat_feature_datanumpy.ndarray

Categorical features for all objects from the dataset in the form of numpy.ndarray of shape (object_count x cat_feature_count) with dtype “object”.

The elements must be of bytes type and should contain UTF-8 encoded strings.
Attention.

Categorical features must be passed as strings, for example:

data=FeaturesData(cat_feature_data=np.array([['a','c'], ['b', 'c']], dtype=object))
Using other data types (for example, int32) raises an error.
None (the dataset does not contain categorical features)
num_feature_names
  • list of strings
  • list of bytes

The names of numerical features in the form of a sequence of strings or bytes.

If the string is represented by the bytes type, it must be UTF-8 encoded.

None (the num_feature_names data attribute is set to a list of empty strings)
cat_feature_names
  • list of strings
  • list of bytes

The names of categorical features in the form of a sequence of strings or bytes.

If the string is represented by the bytes type, it must be UTF-8 encoded.

None (the cat_feature_names data attribute is set to a list of empty strings)

Specifics

  • The order of features in the created Pool is the following:
    [num_features (if any present)][cat_features (if any present)]
  • The feature data must be passed in the same order when applying the trained model.

Methods

MethodDescription
get_object_count

Return the number of objects contained in the dataset.

get_num_feature_count

Return the number of numerical features contained in the dataset.

get_cat_feature_count

Return the number of categorical features contained in the dataset.

get_feature_count

Return the total number of features (both numerical and categorical) contained in the dataset.

get_feature_names

Return the names of features from the dataset.

Usage examples

CatBoostClassifier with FeaturesData
import numpy as np
from catboost import CatBoostClassifier, FeaturesData
# Initialize data
cat_features = [0,1,2]
train_data = FeaturesData(
    num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32),
    cat_feature_data=np.array([[b"a", b"b"], [b"a", b"b"], [b"c", b"d"]], dtype=object)
)
train_labels = [1,1,-1]
test_data = FeaturesData(
    num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32),
    cat_feature_data=np.array([[b"a", b"b"], [b"a", b"d"]], dtype=object)
)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss')
# Fit model
model.fit(train_data, train_labels)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')
CatBoostClassifier with Pool and FeaturesData
import numpy as np
from catboost import CatBoostClassifier, FeaturesData, Pool
# Initialize data
train_data = Pool(
    data=FeaturesData(
        num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32),
        cat_feature_data=np.array([[b"a", b"b"], [b"a", b"b"], [b"c", b"d"]], dtype=object)
    ),
    label=[1, 1, -1]
)
test_data = Pool(
    data=FeaturesData(
        num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32),
        cat_feature_data=np.array([[b"a", b"b"], [b"a", b"d"]], dtype=object)
    )
)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss')
# Fit model
model.fit(train_data)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')