Table of Contents
Fetching ...

Revisiting Language Models in Neural News Recommender Systems

Yuyue Zhao, Jin Huang, David Vos, Maarten de Rijke

TL;DR

The paper investigates whether increasing language model size consistently improves neural news recommender systems, using a unified evaluation across SLM, PLM, and LLM encoders with three representative RS models on the MIND-small dataset. It analyzes non-fine-tuned and fine-tuned configurations, plus a two-step fine-tuning approach for LLMs, and evaluates performance with metrics such as $AUC$, $MRR$, $nDCG@5$, and $nDCG@10$. Key findings show that larger LMs do not reliably boost accuracy and require careful hyperparameter tuning and greater compute, though they offer notable improvements for cold-start users. The work provides practical guidance on when to deploy larger LMs in news RS, balancing performance gains against resource constraints, and suggests directions for improving stability and effectiveness across diverse user groups.

Abstract

Neural news recommender systems (RSs) have integrated language models (LMs) to encode news articles with rich textual information into representations, thereby improving the recommendation process. Most studies suggest that (i) news RSs achieve better performance with larger pre-trained language models (PLMs) than shallow language models (SLMs), and (ii) that large language models (LLMs) outperform PLMs. However, other studies indicate that PLMs sometimes lead to worse performance than SLMs. Thus, it remains unclear whether using larger LMs consistently improves the performance of news RSs. In this paper, we revisit, unify, and extend these comparisons of the effectiveness of LMs in news RSs using the real-world MIND dataset. We find that (i) larger LMs do not necessarily translate to better performance in news RSs, and (ii) they require stricter fine-tuning hyperparameter selection and greater computational resources to achieve optimal recommendation performance than smaller LMs. On the positive side, our experiments show that larger LMs lead to better recommendation performance for cold-start users: they alleviate dependency on extensive user interaction history and make recommendations more reliant on the news content.

Revisiting Language Models in Neural News Recommender Systems

TL;DR

The paper investigates whether increasing language model size consistently improves neural news recommender systems, using a unified evaluation across SLM, PLM, and LLM encoders with three representative RS models on the MIND-small dataset. It analyzes non-fine-tuned and fine-tuned configurations, plus a two-step fine-tuning approach for LLMs, and evaluates performance with metrics such as , , , and . Key findings show that larger LMs do not reliably boost accuracy and require careful hyperparameter tuning and greater compute, though they offer notable improvements for cold-start users. The work provides practical guidance on when to deploy larger LMs in news RS, balancing performance gains against resource constraints, and suggests directions for improving stability and effectiveness across diverse user groups.

Abstract

Neural news recommender systems (RSs) have integrated language models (LMs) to encode news articles with rich textual information into representations, thereby improving the recommendation process. Most studies suggest that (i) news RSs achieve better performance with larger pre-trained language models (PLMs) than shallow language models (SLMs), and (ii) that large language models (LLMs) outperform PLMs. However, other studies indicate that PLMs sometimes lead to worse performance than SLMs. Thus, it remains unclear whether using larger LMs consistently improves the performance of news RSs. In this paper, we revisit, unify, and extend these comparisons of the effectiveness of LMs in news RSs using the real-world MIND dataset. We find that (i) larger LMs do not necessarily translate to better performance in news RSs, and (ii) they require stricter fine-tuning hyperparameter selection and greater computational resources to achieve optimal recommendation performance than smaller LMs. On the positive side, our experiments show that larger LMs lead to better recommendation performance for cold-start users: they alleviate dependency on extensive user interaction history and make recommendations more reliant on the news content.
Paper Structure (15 sections, 2 equations, 7 figures, 3 tables)

This paper contains 15 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The typical structure of neural news recommendation methods.
  • Figure 2: SLMs and PLMs as building blocks of news encoders. Each LM can be used either in its non-fine-tuned form, shown in the left plots, or in its fine-tuned form, shown in the right plots. The parameters/embeddings in the blue "ice" section are fixed, while those in the red "flame" section are fine-tuned.
  • Figure 3: Fine-tuning LLMs as news encoders. In step 1, the LLMs are fine-tuned on news data presented in a natural language format. In step 2, the fine-tuned LLMs generate news embeddings, which are used for the recommendation task.
  • Figure 4: Effect of fine-tuning versus no fine-tuning in the BERT family.
  • Figure 5: Effect of varying the number of fine-tuned layers in BERT.
  • ...and 2 more figures