Calculate object importance

Purpose

Calculate the effect of objects from the training dataset on the optimized metric values for the objects from the validation dataset:
  • Positive values reflect that the optimized metric increases.
  • Negative values reflect that the optimized metric decreases.
The higher the deviation from 0, the bigger the impact that an object has on the optimized metric.

This mode is an implementation of the approach described in the Finding Influential Training Samples for Gradient Boosted Decision Trees paper .

Execution format

catboost ostr [optional parameters]

Options

OptionDescriptionDefault value

-m

--model-path

The name of the input file with the description of the model obtained as the result of training.

model.bin
--model-format

The format of the input model.

Possible values:
  • CatboostBinary.
  • AppleCoreML (only datasets without categorical features are supported).
  • json (multiclassification models are not currently supported). Refer to the CatBoost JSON model tutorial for format details.
CatboostBinary

-f

--learn-set

The path to the input file that contains the dataset description.

Required parameter (the path must be specified).

-t

--test-set

The path to the input file that contains the validation dataset description (the format must be the same as used in the training dataset).Required parameter

--cd

--column-description

The path to the input file that contains the column descriptions.

If omitted, it is assumed that the first column in the file with the dataset description defines the label value, and the other columns are the values of numerical features.

-o

--output-path

The name of the output file that contains the resulting values of the model for the input objects.

The format depends on the problem being solved.
object_importances.tsv

-T

--thread-count

The number of threads to use during training.

Optimizes the speed of execution. This parameter doesn't affect results.

The number of processor cores
--delimiter

The delimiter character used to separate the data in the dataset description input file.

Only single char delimiters are supported. If the specified value contains more than one character, only the first one is used.

The input data is assumed to be tab-separated
--has-header

Read the column names from the first line if this parameter is set to True.

False
--update-method

The algorithm accuracy method.

Possible values:
  • SinglePoint — The fastest and least accurate method.
  • TopKLeaves — Specify the number of leaves. The higher the value, the more accurate and the slower the calculation.
  • AllPoints — The slowest and most accurate method.
Supported parameters:
For example, the following value sets the method to TopKLeaves and limits the number of leaves to 3:
TopKLeaves:top=3
SinglePoint