Pool

class Pool(data, 
           label=None,
           cat_features=None,
           column_description=None,
           pairs=None,
           delimiter='\t',
           has_header=False,
           weight=None, 
           group_id=None,
           group_weight=None,
           subgroup_id=None,
           pairs_weight=None
           baseline=None,
           feature_names=None,
           thread_count=-1)

Purpose

Dataset processing.

The fastest way to pass the features data to the Pool constructor (and other CatBoost, CatBoostClassifier, CatBoostRegressor methods that accept it) if most (or all) of your features are numerical is to pass it using FeaturesData class. Another way to get similar performance with datasets that contain numerical features only is to pass features data as numpy.ndarray with numpy.float32 dtype.

Parameters

ParameterPossible typesDescriptionDefault value
data
  • list
  • numpy.array
  • pandas.DataFrame
  • pandas.Series

Dataset in the form of a two-dimensional feature matrix.

Required parameter
catboost.FeaturesDataDataset in the form of catboost.FeaturesData. The fastest way to create a Pool from Python objects.
string

The path to the input file that contains the dataset description.

label
  • list
  • numpy.array
  • pandas.Series
  • pandas.DataFrame

The target variables (in other words, the objects' label values) for the training dataset.

Must be in the form of a one-dimensional array of floating point numbers.

None
cat_features
  • list
  • numpy.array

A one-dimensional array of categorical columns indices.

Categorical features of the catboost.Pool object must be equal to those of the model if a catboost.Pool object is used for training.

Note. Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.
None (it is assumed that all columns are the values of numerical features)
column_descriptionstring

The path to the input file that contains the column descriptions.

None
pairs
  • list
  • numpy.array
  • pandas.DataFrame

The pairs description in the form of a two-dimensional matrix of shape N by 2:

  • N is the number of pairs.
  • The first element of the pair is the zero-based index of the winner object from the input dataset for pairwise comparison.
  • The second element of the pair is the zero-based index of the loser object from the input dataset for pairwise comparison.

This information is used for calculation and optimization of Pairwise metrics.

None

string

The path to the input file that contains the pair descriptions.

This information is used for calculation and optimization of Pairwise metrics.

delimiterstring

The delimiter character used to separate the data in the dataset description input file.

Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.

"\t"
has_headerbool

Read the column names from the first line if this parameter is set to True.

False
weight
  • list
  • numpy.array

The weight of each object in the input data in the form of a one-dimensional array-like data.

By default, it is set to 1 for all objects.

Restriction.

Only one of the following parameters can be used at a time:

  • weight
  • group_weight
None
group_weight
  • list
  • numpy.array

The weights of all objects within the defined groups from the input data in the form of one-dimensional array-like data.

Used for calculating the final values of trees. By default, it is set to 1 for all objects in all groups.

Restriction.

Only one of the following parameters can be used at a time:

  • weight
  • group_weight
None
group_id
  • list
  • numpy.array
Group identifiers for all input objects. Supported identifier types are:
  • int
  • string types (string or unicode for Python 2 and bytes or string for Python 3).
Attention.

All objects in the dataset must be grouped by group identifiers if they are present. I.e., the objects with the same group identifier should follow each other in the dataset.

Example

For example, let's assume that the dataset consists of documents . The corresponding groups are , respectively. The feature vectors for the given documents are respectively. Then the dataset can take the following form:

The grouped blocks of lines can be input in any order. For example, the following order is equivalent to the previous one:

None
subgroup_id
  • list
  • numpy.array
Subgroup identifiers for all input objects. Supported identifier types are:
  • int
  • string types (string or unicode for Python 2 and bytes or string for Python 3).
None
pairs_weight
  • list
  • numpy.array

The weight of each input pair of objects in the form of one-dimensional array-like pairs.

This information is used for calculation and optimization of Pairwise metrics.

By default, it is set to 1 for all pairs.

None
baseline
  • list
  • numpy.array

Array of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

None
feature_nameslist

A list of names for each feature in the dataset.

None
thread_countint

The number of threads to use when reading data from file.

Use only when the dataset is read from an input file.

-1 (the number of threads is equal to the number of cores) (The number of processor cores)

Attributes

AttributeDescription
shape

Return the shape of the dataset.

is_empty_

Indicates that an empty array was input.

Specifics

Methods

MethodDescription
num_row

Return the number of objects contained in the dataset.

num_col

Return the number of columns that contain feature data.

get_features

Return an array of the dataset features

get_label

Return the value of the label assigned to the input data.

get_cat_feature_indices

Return the indices of categorical features found in the input data.

get_weight

Return the list of weights for each object of the dataset.

get_baseline

Return an array of baselines from the dataset.

set_pairs

Set the list of pairs for Pairwise metrics.

set_feature_names

Set names for all features in the dataset.

set_baseline

Set initial formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

set_weight

Set weights for all input objects.

set_group_id

Set identifiers for all input objects.

set_group_weight

Set weights for all objects within the defined group.

set_subgroup_id

Set subgroup identifiers for all input objects.

set_pairs_weight

Set weights for each pair of objects.

slice

Form a slice of the input dataset from the given list of object indices.

Usage examples

CatBoostClassifier with Pool and FeaturesData
import numpy as np
from catboost import CatBoostClassifier, FeaturesData, Pool
# Initialize data
train_data = Pool(
    data=FeaturesData(
        num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32),
        cat_feature_data=np.array([[b"a", b"b"], [b"a", b"b"], [b"c", b"d"]], dtype=object)
    ),
    label=[1, 1, -1]
)
test_data = Pool(
    data=FeaturesData(
        num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32),
        cat_feature_data=np.array([[b"a", b"b"], [b"a", b"d"]], dtype=object)
    )
)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss')
# Fit model
model.fit(train_data)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')