Table of Contents
Fetching ...

Extending MovieLens-32M to Provide New Evaluation Objectives

Mark D. Smucker, Houmaan Chamani

TL;DR

This paper tackles the misalignment between offline recommender evaluation and user tasks by extending MovieLens-32M with a watchlist-oriented evaluation objective. Using pooling to assemble relevance judgments from 51 participants across 22 diverse algorithms, the study demonstrates that evaluating by user interest in watching reduces popularity bias that appears in traditional train/test setups. The results show strong alignment between interest-based judgments and high predicted ratings, and reveal that pooling can significantly reorder popular baselines, suggesting more faithful offline evaluation of top-n recommendations. The authors advocate for adopting interest-based evaluation with compatibility measures and propose this ML-32M extension as a pilot toward broader, IR-style test collections for recommender systems.

Abstract

Offline evaluation of recommender systems has traditionally treated the problem as a machine learning problem. In the classic case of recommending movies, where the user has provided explicit ratings of which movies they like and don't like, each user's ratings are split into test and train sets, and the evaluation task becomes to predict the held out test data using the training data. This machine learning style of evaluation makes the objective to recommend the movies that a user has watched and rated highly, which is not the same task as helping the user find movies that they would enjoy if they watched them. This mismatch in objective between evaluation and task is a compromise to avoid the cost of asking a user to evaluate recommendations by watching each movie. We offer an extension to the MovieLens-32M dataset that provides for new evaluation objectives. Our primary objective is to predict the movies that a user would be interested in watching, i.e. predict their watchlist. To construct this extension, we recruited MovieLens users, collected their profiles, made recommendations with a diverse set of algorithms, pooled the recommendations, and had the users assess the pools. This paper demonstrates the feasibility of using pooling to construct a test collection for recommender systems. Notably, we found that the traditional machine learning style of evaluation ranks the Popular algorithm, which recommends movies based on total number of ratings in the system, in the middle of the twenty-two recommendation runs we used to build the pools. In contrast, when we rank the runs by users' interest in watching movies, we find that recommending popular movies as a recommendation algorithm becomes one of the worst performing runs. It appears that by asking users to assess their personal recommendations, we can alleviate the issue of popularity bias in the evaluation of top-n recommendation.

Extending MovieLens-32M to Provide New Evaluation Objectives

TL;DR

This paper tackles the misalignment between offline recommender evaluation and user tasks by extending MovieLens-32M with a watchlist-oriented evaluation objective. Using pooling to assemble relevance judgments from 51 participants across 22 diverse algorithms, the study demonstrates that evaluating by user interest in watching reduces popularity bias that appears in traditional train/test setups. The results show strong alignment between interest-based judgments and high predicted ratings, and reveal that pooling can significantly reorder popular baselines, suggesting more faithful offline evaluation of top-n recommendations. The authors advocate for adopting interest-based evaluation with compatibility measures and propose this ML-32M extension as a pilot toward broader, IR-style test collections for recommender systems.

Abstract

Offline evaluation of recommender systems has traditionally treated the problem as a machine learning problem. In the classic case of recommending movies, where the user has provided explicit ratings of which movies they like and don't like, each user's ratings are split into test and train sets, and the evaluation task becomes to predict the held out test data using the training data. This machine learning style of evaluation makes the objective to recommend the movies that a user has watched and rated highly, which is not the same task as helping the user find movies that they would enjoy if they watched them. This mismatch in objective between evaluation and task is a compromise to avoid the cost of asking a user to evaluate recommendations by watching each movie. We offer an extension to the MovieLens-32M dataset that provides for new evaluation objectives. Our primary objective is to predict the movies that a user would be interested in watching, i.e. predict their watchlist. To construct this extension, we recruited MovieLens users, collected their profiles, made recommendations with a diverse set of algorithms, pooled the recommendations, and had the users assess the pools. This paper demonstrates the feasibility of using pooling to construct a test collection for recommender systems. Notably, we found that the traditional machine learning style of evaluation ranks the Popular algorithm, which recommends movies based on total number of ratings in the system, in the middle of the twenty-two recommendation runs we used to build the pools. In contrast, when we rank the runs by users' interest in watching movies, we find that recommending popular movies as a recommendation algorithm becomes one of the worst performing runs. It appears that by asking users to assess their personal recommendations, we can alleviate the issue of popularity bias in the evaluation of top-n recommendation.

Paper Structure

This paper contains 20 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Traditional train/test evaluation with 51 participants vs. 10K ML-32M users for ratings $\ge 4.0$. This figure shows the effect of using the 51 study participants rather than a random sample of ML-32M users. Both axes represent the same train/test evaluation approach but each axis has a different set of user profiles. We can see that for the top half of runs, there are small changes in the ranking of the algorithms between our 51 participants and the random 10K ML-32M users. The majority of the rank changes are occurring with the lower performing algorithms. With a traditional train/test evaluation approach the Popular algorithm is ranked in the middle of the 22 runs.
  • Figure 2: Traditional train/test evaluation vs. pooling based evaluation. Each axis shows an evaluation with the same 51 participants. The vertical axis shows nDCG@100 scores for the high ($\ge 4.0$) test ratings in a train/test split evaluation. The horizontal axis shows the pooling-based evaluation with nDCG@100 scores for high-rating.qrels ($\ge 4.0$). Of note, the Popular algorithm has dropped from rank 11 with train/test evaluation to rank 19 with pooling-based evaluation.
  • Figure 3: Pooling-based evaluation with our 51 participants on both axes using preferences and the compatibility (p=0.98) measure. The vertical axis shows the performance of the runs when measured with the interest-prefer-less-familiar.qrels and the horizontal shows the regular interest.qrels. Of note, the Popular run becomes the worst of all runs when we score by interest in watching, and within preference levels, we prefer less familiar movies.