fit

Train a model.

Note. Set the task_type parameter in the class constructor to GPU to train the model on GPU. Training on GPU requires NVIDIA Driver of version 390.xx or higher.

Method call format

fit(X, 
    y=None, 
    cat_features=None, 
    pairs=None, 
    sample_weight=None, 
    group_id=None,
    group_weight=None,
    subgroup_id=None,
    pairs_weight=None 
    baseline=None, 
    use_best_model=None, 
    eval_set=None,
    verbose=None,
    logging_level=None, 
    plot=False,
    column_description=None,
    verbose_eval=None, 
    metric_period=None, 
    silent=None, 
    early_stopping_rounds=None
    save_snapshot=None, 
    snapshot_file=None, 
    snapshot_interval=None)

Parameters

Some parameters duplicate the ones specified in the constructor of the CatBoost class. In these cases the values specified for the fit method take precedence. The rest of the training parameters must be set in the constructor of the CatBoost class.

ParameterPossible typesDescriptionDefault valueSupported processing units
Xcatboost.PoolThe input training dataset in the form of a pool object.Required parameter

CPU and GPU

The input training dataset in the form of a two-dimensional feature matrix.

y
  • list
  • numpy.array
  • pandas.DataFrame
  • pandas.Series

The target variables (in other words, the objects' label values) for the training dataset.

Must be in the form of a one-dimensional array of floating point numbers.

Note. Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.
None

CPU and GPU

cat_features
  • list
  • numpy.array

A one-dimensional array of categorical columns indices.

Categorical features of the catboost.Pool object must be equal to those of the model if a catboost.Pool object is used for training.

Note. Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.
None (all features are considered numerical)

CPU and GPU

pairs
  • list
  • numpy.array
  • pandas.DataFrame

The pairs description in the form of a two-dimensional matrix of shape N by 2:

  • N is the number of pairs.
  • The first element of the pair is the zero-based index of the winner object from the input dataset for pairwise comparison.
  • The second element of the pair is the zero-based index of the loser object from the input dataset for pairwise comparison.

This information is used for calculation and optimization of Pairwise metrics.

None

CPU

sample_weight
  • list
  • numpy.array
  • pandas.DataFrame
  • pandas.Series

The weight of each object in the input data in the form of a one-dimensional array-like data.

By default, it is set to 1 for all objects.

None

CPU and GPU

group_id
  • list
  • numpy.array
Group identifiers for all input objects. Supported identifier types are:
  • int
  • string types (string or unicode for Python 2 and bytes or string for Python 3).

None

CPU

group_weight
  • list
  • numpy.array

The weights of all objects within the defined groups from the input data in the form of one-dimensional array-like data.

Used for calculating the final values of trees. By default, it is set to 1 for all objects in all groups.

Restriction.

Only one of the following parameters can be used at a time:

  • weight
  • group_weight
None

CPU

subgroup_id
  • list
  • numpy.array
Subgroup identifiers for all input objects. Supported identifier types are:
  • int
  • string types (string or unicode for Python 2 and bytes or string for Python 3).
None

CPU

pairs_weight
  • list
  • numpy.array

The weight of each input pair of objects in the form of one-dimensional array-like pairs.

This information is used for calculation and optimization of Pairwise metrics.

By default, it is set to 1 for all pairs.

None

CPU

baseline
  • list
  • numpy.array

Array of formula values for all input objects. The training starts from these values for all input objects instead of starting from zero.

Note. Do not use this parameter if the input training dataset (specified in the X parameter) type is catboost.Pool.
None

CPU and GPU

use_best_modelbool
If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
  1. Build the number of trees defined by the training parameters.
  2. Use the validation dataset to identify the iteration with the optimal value of the metric specified in  --eval-metric (eval_metric).

No trees are saved after this iteration.

This option requires a validation dataset to be provided.

True if a validation set is input (the eval_set parameter is defined) and at least one of the label values of objects in this set differs from the others. False otherwise.

CPU

eval_set
  • catboost.Pool
  • list of catboost.Pool
  • list of (X, y) tuples
The validation dataset or datasets used for the following processes:
None

CPU and GPU

Note. Only a single validation dataset can be input if the training is performed on GPU

verbose

Alias: verbose_eval

  • bool
  • int

The purpose of this parameter depends on the type of the given value:

  • bool — Defines the logging level:
    • “True”  corresponds to the Verbose logging level
    • “False” corresponds to the Silent logging level
  • int — Use the Verbose logging level and set the logging period to the value of this parameter.
Restriction. Do not use this parameter with the logging_level parameter.
1

CPU and GPU

logging_levelstring

The logging level to output to stdout.

Possible values:
  • Silent — Do not output any logging information to stdout.

  • Verbose — Output the following data to stdout:

    • optimized metric
    • elapsed time of training
    • remaining time of training
  • Info — Output additional information and the number of trees.

  • Debug — Output debugging information.
None (corresponds to the Verbose logging level)

CPU and GPU

plotbool
Plot the following information during training:
  • the metric values;
  • the custom loss values;
  • the time has passed since training started;
  • the remaining time until the end of training.
This option can be used if training is performed in Jupyter notebook.
False

CPU

column_descriptionstring

The path to the input file that contains the column descriptions.

The given file is used to build pools from the train and/or validation datasets, which are input from files.

None

CPU and GPU

metric_periodint

The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer.

The usage of this parameter speeds up the training.

Note.

It is recommended to increase the value of this parameter to maintain training speed if a GPU processing unit type is used.

1

CPU and GPU

silentboolDefines the logging level:
  • “True” — corresponds to the Silent logging level
  • “False” — corresponds to the Verbose logging level
False

CPU and GPU

early_stopping_roundsintSet the overfitting detector type to Iter and stop the training after the specified number of iterations since the iteration with the optimal metric value.False

CPU and GPU

save_snapshotbool

Enable snapshotting for restoring the training progress after an interruption.

None

CPU and GPU

snapshot_filestringThe name of the file to save the training progress information in. This file is used for recovering training after an interruption.
Depending on whether the specified file exists in the file system:
  • Missing — Write information about training progress to the specified file.
  • Exists — Load data from the specified file and continue training from where it left off.

experiment...

CPU and GPU

snapshot_intervalint

The interval between saving snapshots in seconds.

The first snapshot is taken after the specified number of seconds since the start of training. Every subsequent snapshot is taken after the specified number of seconds since the previous one. The last snapshot is taken at the end of the training.

600

CPU and GPU