Table of Contents
Fetching ...

A Worrying Reproducibility Study of Intent-Aware Recommendation Models

Faisal Shehzad, Maurizio Ferrari Dacrema, Dietmar Jannach

TL;DR

The paper addresses reproducibility in intent-aware recommender systems by attempting to reproduce five contemporary neural IARS models using authors’ artifacts and comparing them to well-tuned traditional baselines. Across multiple datasets, many reproduced results diverge from the originals, and in several cases, traditional baselines outperform the proposed IARS models when properly tuned. The study highlights pervasive issues such as insufficient artifact sharing, inconsistent baselines, and data-splitting or hyperparameter-tuning gaps, culminating in a call for stronger reproducibility practices, containerized artifacts, and transparent evaluation protocols. The findings suggest that reported progress in IARS may be overstated without rigorous replication and will have implications for how future IARS research is conducted, evaluated, and shared.

Abstract

Lately, we have observed a growing interest in intent-aware recommender systems (IARS). The promise of such systems is that they are capable of generating better recommendations by predicting and considering the underlying motivations and short-term goals of consumers. From a technical perspective, various sophisticated neural models were recently proposed in this emerging and promising area. In the broader context of complex neural recommendation models, a growing number of research works unfortunately indicates that (i) reproducing such works is often difficult and (ii) that the true benefits of such models may be limited in reality, e.g., because the reported improvements were obtained through comparisons with untuned or weak baselines. In this work, we investigate if recent research in IARS is similarly affected by such problems. Specifically, we tried to reproduce five contemporary IARS models that were published in top-level outlets, and we benchmarked them against a number of traditional non-neural recommendation models. In two of the cases, running the provided code with the optimal hyperparameters reported in the paper did not yield the results reported in the paper. Worryingly, we find that all examined IARS approaches are consistently outperformed by at least one traditional model. These findings point to sustained methodological issues and to a pressing need for more rigorous scholarly practices.

A Worrying Reproducibility Study of Intent-Aware Recommendation Models

TL;DR

The paper addresses reproducibility in intent-aware recommender systems by attempting to reproduce five contemporary neural IARS models using authors’ artifacts and comparing them to well-tuned traditional baselines. Across multiple datasets, many reproduced results diverge from the originals, and in several cases, traditional baselines outperform the proposed IARS models when properly tuned. The study highlights pervasive issues such as insufficient artifact sharing, inconsistent baselines, and data-splitting or hyperparameter-tuning gaps, culminating in a call for stronger reproducibility practices, containerized artifacts, and transparent evaluation protocols. The findings suggest that reported progress in IARS may be overstated without rigorous replication and will have implications for how future IARS research is conducted, evaluated, and shared.

Abstract

Lately, we have observed a growing interest in intent-aware recommender systems (IARS). The promise of such systems is that they are capable of generating better recommendations by predicting and considering the underlying motivations and short-term goals of consumers. From a technical perspective, various sophisticated neural models were recently proposed in this emerging and promising area. In the broader context of complex neural recommendation models, a growing number of research works unfortunately indicates that (i) reproducing such works is often difficult and (ii) that the true benefits of such models may be limited in reality, e.g., because the reported improvements were obtained through comparisons with untuned or weak baselines. In this work, we investigate if recent research in IARS is similarly affected by such problems. Specifically, we tried to reproduce five contemporary IARS models that were published in top-level outlets, and we benchmarked them against a number of traditional non-neural recommendation models. In two of the cases, running the provided code with the optimal hyperparameters reported in the paper did not yield the results reported in the paper. Worryingly, we find that all examined IARS approaches are consistently outperformed by at least one traditional model. These findings point to sustained methodological issues and to a pressing need for more rigorous scholarly practices.
Paper Structure (37 sections, 7 tables)