cv

cv(pool=None, 
   params=None, 
   dtrain=None, 
   iterations=None, 
   num_boost_round=None,
   fold_count=3, 
   nfold=None,
   inverted=False,
   partition_random_seed=0,
   seed=None, 
   shuffle=True, 
   logging_level=None, 
   stratified=False,
   as_pandas=True,
   metric_period=None,
   verbose=None,
   verbose_eval=None,
   plot=False,
   early_stopping_rounds=None)

Purpose

Perform cross-validation on the dataset.

The dataset is split into N folds. N–1 folds are used for training and one fold is used for model performance estimation. At each iteration, the model is evaluated on all N folds independently. The average score with standard deviation is computed for each iteration.

Parameters

Parameter Possible types Description Default value

pool

Alias: dtrain

Pool

The input dataset to cross-validate.

Required parameter
params dict

The list of parameters to start training with.

Required parameter

iterations

Aliases:
  • num_boost_round
  • n_estimators
  • num_trees
int

The maximum number of trees that can be built when solving machine learning problems.

When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter.

1000

fold_count

Alias: nfold

int

The number of folds to split the dataset into.

3
inverted bool

Train on the test fold and evaluate the model on the training folds.

False

partition_random_seed

Alias: seed

int

Use this as the seed value for random permutation of the data.

Permutation is performed before splitting the data for cross validation.

Each seed generates unique data splits.

0
shuffle bool

Shuffle the dataset objects before splitting into folds.

True
logging_level string

The logging level to output to stdout.

Possible values:
  • Silent — Do not output any logging information to stdout.

  • Verbose — Output the following data to stdout:

    • optimized metric
    • elapsed time of training
    • remaining time of training
  • Info — Output additional information and the number of trees.

  • Debug — Output debugging information.
None (corresponds to the Verbose logging level)
stratified bool

Perform stratified sampling.

False
as_pandas bool

Sets the type of return value to pandas.DataFrame.

The type of return value is dict if this parameter is set to False or the pandas Python package is not installed.

True
metric_period int

The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer.

The usage of this parameter speeds up the training.

1

verbose

Alias: verbose_eval

  • bool
  • int

The purpose of this parameter depends on the type of the given value:

  • bool — Defines the logging level:
    • “True”  corresponds to the Verbose logging level
    • “False” corresponds to the Silent logging level
  • int — Use the Verbose logging level and set the logging period to the value of this parameter.
False
plot bool
Plot the following information during training:
  • the metric values;
  • the custom loss values;
  • the time has passed since training started;
  • the remaining time until the end of training.
This option can be used if training is performed in Jupyter notebook.
False
early_stopping_rounds int Set the overfitting detector type to Iter and stop the training after the specified number of iterations since the iteration with the optimal metric value. False

Type of return value

Depends on the value of the as_pandas parameter and the availability of the pandas Python package:
as_pandas value pandas Python package availability Type of return value
True Installed pandas.DataFrame
True Not installed dict
False Unimportant

Each key (if the output type is dict) or column name (if the output type is pandas.DataFrame) is formed from the evaluation dataset type (train or test), metric name, and computed characteristic (std, mean, etc.). Each value is a list of corresponding computed values.

For example, if only the RMSE metric is specified in the parameters, then the return value is:

   test-Logloss-mean  test-Logloss-std  train-Logloss-mean  train-Logloss-std
0           0.398250          0.006558            0.394166           0.003950
1           0.351388          0.000644            0.348041           0.000795
2           0.340215          0.007079            0.336723           0.003994
3           0.332771          0.001593            0.329009           0.005679

Each key or column value contains the same number of calculated values as the number of training iterations (or less, if the overfitting detection is turned on and the threshold is reached earlier).

Examples

Perform cross-validation on the given dataset:
from catboost import Pool, cv

pool = Pool(x_train, y_train)
params = {'iterations': 100, 
          'depth': 2, 
          'loss_function': 'MultiClass', 
          'classes_count': 3, 
          'verbose': False}
scores = cv(pool, params)
Perform cross-validation and save ROC curve points to the roc-curve output file, given that the Python package is installed to the /home/ironman/ directory:
from catboost import Pool, cv

input_pool = Pool("/home/ironman/catboost/pytest/data/adult/train_small", 
column_description="/home/ironman/catboost/pytest/data/adult/train.cd")
params = {'iterations': 100, 
          'depth': 2, 
          'loss_function': 'Logloss', 
          'verbose': False, 
          'roc_file': 'roc-file'}
scores = cv(input_pool, params)
Perform cross-validation on GPU with pairwise ranking, given that the Python package is installed to the /home/ironman/ directory:
import os.path
from catboost import Pool, cv

pool_path = '/home/ironman/catboost/pytest/data/querywise/'

pool = Pool(
    os.path.join(pool_path, 'train'),
    column_description=os.path.join(pool_path, 'train.cd'),
    pairs=os.path.join(pool_path, 'train.pairs')
)
scores = cv(
    pool,
    params = {
        "iterations": 100,
        "loss_function": "PairLogit",
        "task_type": 'GPU'
    }
)