catboost.load_pool

catboost.load_pool(data, 
                   label = NULL, 
                   cat_features = NULL, 
                   column_description = NULL, 
                   pairs = NULL, 
                   delimiter = "\t", 
                   has_header = FALSE, 
                   weight = NULL, 
                   group_id = NULL,
                   group_weight = NULL,
                   subgroup_id = NULL,
                   pairs_weight = NULL, 
                   baseline = NULL, 
                   feature_names = NULL, 
                   thread_count = -1 (the number of threads is equal to the number of cores))

Purpose

Load the CatBoost dataset.

Arguments

ArgumentDescriptionDefault value
data

A file path, data.frame or matrix with features.

The following column types are supported:
  • double
  • factor. It is assumed that categorical features are given in this type of columns. A standard CatBoost processing procedure is applied to this type of columns:
    1. The values are converted to strings.
    2. The ConvertCatFeatureToFloat function is applied to the resulting string.
Required argument
label

The target variables (in other words, the objects' label values) for the training dataset.

This parameter is used if the input data format is matrix or data.frame. Otherwise it must be set to NULL.

NULL
cat_features

A vector of categorical features indices.

The indices are zero-based and can differ from the ones given in the Column descriptions file.

NULL (it is assumed that all columns are the values of numerical features)
column_description

The path to the input file that contains the column descriptions.

This parameter is used if the data is input from a file.

NULL, it is assumed that the first column in the file with the dataset description defines the label value, and the other columns are the values of numerical features.
pairs

A file path, matrix or data.frame with  pairs descriptions of shape N by 2:

  • N is the number of pairs.
  • The first element of the pair is the zero-based index of the winner object from the input dataset for pairwise comparison.
  • The second element of the pair is the zero-based index of the loser object from the input dataset for pairwise comparison.

This information is used for calculation and optimization of Pairwise metrics.

NULL
delimiter

The delimiter character used to separate the data in the dataset description input file.

Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.

\t
has_header

Read the column names from the first line if this parameter is set to True.

FALSE
weightThe weights of the label values vector.NULL
group_id

Group identifiers for all input objects.

Attention.

All objects in the dataset must be grouped by group identifiers if they are present. I.e., the objects with the same group identifier should follow each other in the dataset.

Example

For example, let's assume that the dataset consists of documents . The corresponding groups are , respectively. The feature vectors for the given documents are respectively. Then the dataset can take the following form:

The grouped blocks of lines can be input in any order. For example, the following order is equivalent to the previous one:

NULL
group_weight

The weights of all objects within the defined groups from the input data in the form of one-dimensional array-like data.

Used for calculating the final values of trees. By default, it is set to 1 for all objects in all groups.

Restriction.

Only one of the following parameters can be used at a time:

  • group_id
  • group_weight
NULL
subgroup_id

Subgroup identifiers for all input objects.

NULL
pairs_weight

The weight of each input pair of objects.

This information is used for calculation and optimization of Pairwise metrics.

By default, it is set to 1 for all pairs.

Do not use this parameter if an input file is specified in the pairs parameter.

NULL
baseline

A vector of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

NULL
feature_names

A list of names for each feature in the dataset.

NULL
thread_countThe number of threads to use while reading the data.

Optimizes the reading time. This parameter doesn't affect the results.

-1 (the number of threads is equal to the number of cores) (The number of processor cores)

Examples

Load the dataset description and the object descriptions from the train and train.cd files respectively (both stored in the current directory):
library(catboost)

pool_path = system.file("extdata", 
                        "adult_train.1000", 
                        package = "catboost")
column_description_path = system.file("extdata", 
                                      "adult.cd", 
                                      package = "catboost")
pool <- catboost.load_pool(pool_path, 
                           column_description = column_description_path)
head(pool, 1)[[1]]
Load the dataset from the CatBoost R package (this dataset is a subset of the Adult Data Set distributed through the UCI Machine Learning Repository):
library(catboost)

pool_path = system.file("extdata", 
                        "adult_train.1000", 
                        package="catboost")

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
  column_description_vector[i] <- 'factor'

data <- read.table(pool_path, 
                   head = F, 
                   sep = "\t", 
                   colClasses = column_description_vector, 
                   na.strings='NAN')

# Transform categorical features to numerical
for (i in cat_features)
  data[,i] <- as.numeric(factor(data[,i]))

pool <- catboost.load_pool(as.matrix(data[,-target]),
                           label = as.matrix(data[,target]),
                           cat_features = cat_features - 2)
head(pool, 1)[[1]]
Load the dataset from data.frame:
library(catboost)

train_path = system.file("extdata", 
                         "adult_train.1000", 
                         package="catboost")
test_path = system.file("extdata", 
                        "adult_test.1000", 
                        package="catboost")

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
  column_description_vector[i] <- 'factor'

train <- read.table(train_path, 
                    head = F, 
                    sep = "\t", 
                    colClasses = column_description_vector, 
                    na.strings='NAN')
test <- read.table(test_path, 
                   head = F, 
                   sep = "\t", 
                   colClasses = column_description_vector, 
                   na.strings='NAN')
target <- c(1)
train_pool <- catboost.load_pool(data=train[,-target], 
                                 label = train[,target])
test_pool <- catboost.load_pool(data=test[,-target], 
                                label = test[,target])
head(train_pool, 1)[[1]]
head(test_pool, 1)[[1]]