What Are We Optimizing For? A Human-centric Evaluation of Deep Learning-based Movie Recommenders

Ruixuan Sun; Xinyi Wu; Avinash Akella; Ruoyan Kong; Bart Knijnenburg; Joseph A. Konstan

What Are We Optimizing For? A Human-centric Evaluation of Deep Learning-based Movie Recommenders

Ruixuan Sun, Xinyi Wu, Avinash Akella, Ruoyan Kong, Bart Knijnenburg, Joseph A. Konstan

TL;DR

This work interrogates whether deep-learning movie recommenders optimize for user-grounded success, not just offline accuracy. By collecting real-user judgments on DL-RecSys outputs across seven human-centric metrics and applying path analysis, it demonstrates that DL models improve novelty and serendipity but struggle with diversity, transparency, trust, and overall satisfaction relative to traditional collaborative filtering. The study links user context and model attributes to downstream perceptions, revealing that excessive serendipity and limited diversity erode transparency and trust, which in turn reduces satisfaction. It also shows the value of user input on desirable attributes and outlines design directions, such as diversity-aware training and explainability, to enhance human-centric performance in DL-RecSys.

Abstract

In the past decade, deep learning (DL) models have gained prominence for their exceptional accuracy on benchmark datasets in recommender systems (RecSys). However, their evaluation has primarily relied on offline metrics, overlooking direct user perception and experience. To address this gap, we conduct a human-centric evaluation case study of four leading DL-RecSys models in the movie domain. We test how different DL-RecSys models perform in personalized recommendation generation by conducting survey study with 445 real active users. We find some DL-RecSys models to be superior in recommending novel and unexpected items and weaker in diversity, trustworthiness, transparency, accuracy, and overall user satisfaction compared to classic collaborative filtering (CF) methods. To further explain the reasons behind the underperformance, we apply a comprehensive path analysis. We discover that the lack of diversity and too much serendipity from DL models can negatively impact the consequent perceived transparency and personalization of recommendations. Such a path ultimately leads to lower summative user satisfaction. Qualitatively, we confirm with real user quotes that accuracy plus at least one other attribute is necessary to ensure a good user experience, while their demands for transparency and trust can not be neglected. Based on our findings, we discuss future human-centric DL-RecSys design and optimization strategies.

What Are We Optimizing For? A Human-centric Evaluation of Deep Learning-based Movie Recommenders

TL;DR

Abstract

Paper Structure (19 sections, 3 figures, 9 tables)

This paper contains 19 sections, 3 figures, 9 tables.

Introduction
Related Work
Evaluation of DL-RecSys
User Perspective in Recommender Systems
Research Methods
Deep Learning and Baseline Models
Users and Training Data
Survey Design
Path Analysis
Results
Individual Model Performance
User Perception Path
Qualitative Analysis
Discussion
Current State of DL-RecSys models
...and 4 more sections

Figures (3)

Figure 1: User evaluation flow and survey questions.
Figure 2: Marginal effects (both direct and indirect) of each model to the downstream variables in the path. Error bars indicate standard errors.
Figure 3: Path analysis on data we have from this study. The model shows how user contextual factors and different model perception metrics can influence each other and overall user satisfaction. Each arrow indicates a direct effect between one variable to another, with the $\beta_{\mathop{\mathrm{coef}}\nolimits}$ and standard error associated on the line. Asterisk (*) indicates effect p-val: * for $p$ < .05, ** for $p$ < .01 and *** for $p$ < .001.

What Are We Optimizing For? A Human-centric Evaluation of Deep Learning-based Movie Recommenders

TL;DR

Abstract

What Are We Optimizing For? A Human-centric Evaluation of Deep Learning-based Movie Recommenders

Authors

TL;DR

Abstract

Table of Contents

Figures (3)