Data description

Train set

Each row in the training set correspond to a product listing in MercadoLibre and it includes the following data:

Column Description
Title The listing title.
Language The title’s language, which can be either spanish or portuguese.
Label quality It can be either reliable or unreliable. For the reliables ones, we spent time reviewing those labels to minimize the amount of error there (however, they might not be 100% accurate either). The rows marked as unreliable were not reviewed by us, meaning that the category was picked by the seller and, therefore, a higher rate of mislabelling may be expected.
Category The target label. The set of possible categories are the same in both languages. That means that a cellphone’s title will map to the CELLPHONES category regardless it is in spanish or portuguese.

The distribution of reliable cases on the training set is not evenly distributed. There may be categories containing many reliable examples, as well as categories containing no reliable examples at all.

Test set

The test set contains only three columns: id, title and language. The id will be useful to identify each sample in your submission file, while the title and language can be used as input features for you model.

Sample submission

We provide a sample submission file to visualize the expected format of your submission.

Download links