Evaluation

Metric

Submissions will be scored using the average NDCG.

NDCG stands for Normalized Discounted Cumulative Gain and it is a popular method for measuring the quality of a set results. The metric could be easily adapted to the recommendations domain.

NDCG is based on the following premises:

  • cumulative gain: Very relevant recommendations are more useful than somewhat relevant recommendations which are more useful than irrelevant recommendations. (i.e. we want to recommend relevant items for the user!)
  • discounting: Relevant recommendations are more useful when they appear earlier in the set of recommendations (i.e. the order matters!).
  • normalization: We want the metric to lie in the interval [0,1].
  • The formula we will use to compute NDCG goes as it follows:

    In other words, NDCG is just the Discounted Cumulative Gain divided by the Discounted Cumulative Gain of the ideal recommendation set.

    Now, how do we calculate Discounted Cumulative Gain?

    Well, we could just add up the relevance of each item in the recommendation set (this will give us the Cumulative Gain). The problem with this metric is that it doesn't take into the account the way the recommended items are sorted. Thus recommending the proper item in the 10th position will be valued as much as doing it in the 1st position.

    We don't want this to happen, that's why we discount Cumulative Gain by weighthing each relevance inversely to its position, using logarithms to emphasize this effect.

    This results in something like this:

    Finally, we need to define relevance in some way.

    For this problem, we have decided that the relevance of predicting a certain item y^ for a corresponding purchase y is given by the following function:


    This means that...

  • If you hit the item_id of the user's target purchase, you'll get a relevance of 12 for that recommendation in your recommendation set.
  • Instead, if your prediction doesn't match the correct item id but it does correspond to the same domain you will get 1 relevance point.
  • In case you do not match any of them, your relevance will be 0.
  • The ideal recommendation set will be that comprised of the same item-id as the target in the 1st position, followed by 9 item-ids belonging to the same domain-id as the target item in the remaining recommendations.

    Submissions

    Results should be submitted in a .csv file with ten item-ids per row, separated by commas, with no header. Each of these item-ids would be part of a set of the recommendations for the corresponding row number in the evaluation set. (Yes, we’ll assume that your predictions are sorted in the same way the original test dataset is).

    If any row has less than ten items, your recommendation will still be a valid answer, but you will receive zero relevance for the missing items in the recommendation set.

    Keep in mind that the way you sort the item-ids within each row matters: the first item id of the row should be to the most likely future purchase, the second item id of the row should be the second most likely future purchase and so on.

    Finally, it is important to notice that, in a given recommendation set (i.e. row), item-ids should be unique (otherwise, you could score greater than 1). If you submit a file with duplicated item-ids in a row, we'll keep only the unique values and mantain to original sorting.

    Together with the challenge data, we are providing a sample submission file to illustrate the expected format we have just described above to make sure that you don’t miss a detail!

    Leaderboard

    Each time you submit a prediction, it will be evaluated on a subset of 30% of the test dataset. That’s the score that will be shown on the public leaderboard.

    Final Evaluation

    Once the competition ends, we will compute a final score of each participant using him/her last submission and scoring it against the remaining 70% of observations in the test dataset.

    Rules

    No MercadoLibre's employees.
    MercadoLibre's employees can not participate in the challenge.

    One account per participant.
    It is not allowed to register multiple accounts and submit from different accounts.

    One account, one person.
    The participation is individual. Teams are not allowed.

    Three daily submissions.
    Maximum.

    No manual labeling.
    Your submission must be automatically generated by a computer program. No manual labelling may be encoded in software.

    No external data.
    You can only use the data we provide. Using MercadoLibre’s APIs or any other data sources to increase the feature set is not allowed for this competition.

    Pre-trained models are allowed.
    You can use any pretrained models, as long as they are publicly available before the competition starts.

    Countries eligible for prizes.
    Only participants from Argentina, Brazil, Colombia, Chile, Mexico and Uruguay are eligible for winning prizes.