fit

Train a model.

Note. Set the task_type parameter in the class constructor to GPU to train the model on GPU. Training on GPU requires NVIDIA Driver of version 390.xx or higher.

Method call format

fit(X, 
    y=None,
    cat_features=None,
    sample_weight=None,
    baseline=None,
    use_best_model=None,
    eval_set=None,
    verbose=None, 
    logging_level=None
    plot=False,
    column_description=None,
    verbose_eval=None, 
    metric_period=None, 
    silent=None, 
    early_stopping_rounds=None,
    save_snapshot=None, 
    snapshot_file=None, 
    snapshot_interval=None)

Parameters

Some parameters duplicate the ones specified in the constructor of the CatBoostClassifier class. In these cases the values specified for the fit method take precedence. The rest of the training parameters must be set in the constructor of the CatBoostClassifier class.

ParameterPossible typesDescriptionDefault valueSupported processing units
Xcatboost.PoolThe input training dataset in the form of a pool object.Required parameter

CPU and GPU

The input training dataset in the form of a two-dimensional feature matrix.

y
  • list
  • numpy.array
  • pandas.DataFrame
  • pandas.Series

The target variables (in other words, the objects' label values) for the training dataset.

Must be in the form of a one-dimensional array of floating point numbers.

Note. Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.
None

CPU and GPU

cat_features
  • list
  • numpy.array

A one-dimensional array of categorical columns indices.

Categorical features of the catboost.Pool object must be equal to those of the model if a catboost.Pool object is used for training.

Note. Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.
None (all features are considered numerical)

CPU and GPU

sample_weight
  • list
  • numpy.array
  • pandas.DataFrame
  • pandas.Series

The weight of each object in the input data in the form of a one-dimensional array-like data.

By default, it is set to 1 for all objects.

None

CPU and GPU

baseline
  • list
  • numpy.array

Array of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

Note. Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.
None

CPU and GPU

use_best_modelbool
If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
  1. Build the number of trees defined by the training parameters.
  2. Use the validation dataset to identify the iteration with the optimal value of the metric specified in  --eval-metric (eval_metric).

No trees are saved after this iteration.

This option requires a validation dataset to be provided.

True if a validation set is input (the eval_set parameter is defined) and at least one of the label values of objects in this set differs from the others. False otherwise.

CPU

eval_set
  • catboost.Pool
  • list of catboost.Pool
  • list of (X, y) tuples
The validation dataset or datasets used for the following processes:
None

CPU and GPU

Note. Only a single validation dataset can be input if the training is performed on GPU

verbose

Alias: verbose_eval

  • bool
  • int

The purpose of this parameter depends on the type of the given value:

  • bool — Defines the logging level:
    • “True”  corresponds to the Verbose logging level
    • “False” corresponds to the Silent logging level
  • int — Use the Verbose logging level and set the logging period to the value of this parameter.
Restriction. Do not use this parameter with the logging_level parameter.
1

CPU and GPU

logging_levelstring

The logging level to output to stdout.

Possible values:
  • Silent — Do not output any logging information to stdout.

  • Verbose — Output the following data to stdout:

    • optimized metric
    • elapsed time of training
    • remaining time of training
  • Info — Output additional information and the number of trees.

  • Debug — Output debugging information.
None (corresponds to the Verbose logging level)

CPU and GPU

plotbool
Plot the following information during training:
  • the metric values;
  • the custom loss values;
  • the time has passed since training started;
  • the remaining time until the end of training.
This option can be used if training is performed in Jupyter notebook.
False

CPU

column_descriptionstring

The path to the input file that contains the column descriptions.

The given file is used to build pools from the train and/or validation datasets, which are input from files.

None

CPU and GPU

silentboolDefines the logging level:
  • “True” — corresponds to the Silent logging level
  • “False” — corresponds to the Verbose logging level
False

CPU and GPU

early_stopping_roundsintSet the overfitting detector type to Iter and stop the training after the specified number of iterations since the iteration with the optimal metric value.False

CPU and GPU

save_snapshotbool

Enable snapshotting for restoring the training progress after an interruption.

None

CPU and GPU

snapshot_filestringThe name of the file to save the training progress information in. This file is used for recovering training after an interruption.
Depending on whether the specified file exists in the file system:
  • Missing — Write information about training progress to the specified file.
  • Exists — Load data from the specified file and continue training from where it left off.

experiment...

CPU and GPU

snapshot_intervalint

The interval between saving snapshots in seconds.

The first snapshot is taken after the specified number of seconds since the start of training. Every subsequent snapshot is taken after the specified number of seconds since the previous one. The last snapshot is taken at the end of the training.

600

CPU and GPU

Usage examples

CatBoostClassifier with FeaturesData
import numpy as np
from catboost import CatBoostClassifier, FeaturesData
# Initialize data
cat_features = [0,1,2]
train_data = FeaturesData(
    num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32),
    cat_feature_data=np.array([[b"a", b"b"], [b"a", b"b"], [b"c", b"d"]], dtype=object)
)
train_labels = [1,1,-1]
test_data = FeaturesData(
    num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32),
    cat_feature_data=np.array([[b"a", b"b"], [b"a", b"d"]], dtype=object)
)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss')
# Fit model
model.fit(train_data, train_labels)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')
CatBoostClassifier with Pool and FeaturesData
import numpy as np
from catboost import CatBoostClassifier, FeaturesData, Pool
# Initialize data
train_data = Pool(
    data=FeaturesData(
        num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32),
        cat_feature_data=np.array([[b"a", b"b"], [b"a", b"b"], [b"c", b"d"]], dtype=object)
    ),
    label=[1, 1, -1]
)
test_data = Pool(
    data=FeaturesData(
        num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32),
        cat_feature_data=np.array([[b"a", b"b"], [b"a", b"d"]], dtype=object)
    )
)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss')
# Fit model
model.fit(train_data)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')