amazon

Load the dataset from Kaggle Amazon Employee Access Challenge.

This dataset is best suited for binary classification.

The training dataset contains 32769 objects. Each object is described by 10 columns of numerical features. The ACTION column is used as the label.

The test dataset contains 58921 objects. The structure is identical to the training dataset with the following variations:
  • The ACTION column is omitted.
  • The id column is added.

Method call format

amazon()

Type of return value

A two pandas.DataFrame tuple (for train and test datasets).

The train dataset contains the “ACTION” label.

Usage examples

from catboost.datasets import amazon
amazon_train, amazon_test = amazon()

print(amazon_train.head(3))

The output of this example:

   ACTION  RESOURCE  MGR_ID  ROLE_ROLLUP_1  ROLE_ROLLUP_2  ROLE_DEPTNAME  ROLE_TITLE  ROLE_FAMILY_DESC  ROLE_FAMILY  ROLE_CODE
0       1     39353   85475         117961         118300         123472      117905            117906       290919     117908
1       1     17183    1540         117961         118343         123125      118536            118536       308574     118539
2       1     36724   14457         118219         118220         117884      117879            267952        19721     11788