Table of Contents
Fetching ...

Are Large Language Models Really Effective for Training-Free Cold-Start Recommendation?

Genki Kusano, Kenya Abe, Kunihiro Takeoka

TL;DR

This study rigorously compares training-free cold-start recommendation (TFCSR) methods based on large language models (LLMs) and text embedding models (TEMs) under identical conditions. Through controlled experiments on three public datasets, TEMs consistently outperform LLM rerankers in both narrow and broad cold-start settings, with TEMs trained via LLM supervision (e.g., Qwen embeddings) achieving the strongest results. The findings challenge the assumption that LLMs are always optimal for training-free scenarios and highlight TEM-based approaches as more scalable and reliable for TFCSR. The work also provides insights into error patterns, the impact of user history size, and cross-domain transfer, outlining directions for integrating structured data and synthetic training signals.

Abstract

Recommender systems usually rely on large-scale interaction data to learn from users' past behaviors and make accurate predictions. However, real-world applications often face situations where no training data is available, such as when launching new services or handling entirely new users. In such cases, conventional approaches cannot be applied. This study focuses on training-free recommendation, where no task-specific training is performed, and particularly on \textit{training-free cold-start recommendation} (TFCSR), the more challenging case where the target user has no interactions. Large language models (LLMs) have recently been explored as a promising solution, and numerous studies have been proposed. As the ability of text embedding models (TEMs) increases, they are increasingly recognized as applicable to training-free recommendation, but no prior work has directly compared LLMs and TEMs under identical conditions. We present the first controlled experiments that systematically evaluate these two approaches in the same setting. The results show that TEMs outperform LLM rerankers, and this trend holds not only in cold-start settings but also in warm-start settings with rich interactions. These findings indicate that direct LLM ranking is not the only viable option, contrary to the commonly shared belief, and TEM-based approaches provide a stronger and more scalable basis for training-free recommendation.

Are Large Language Models Really Effective for Training-Free Cold-Start Recommendation?

TL;DR

This study rigorously compares training-free cold-start recommendation (TFCSR) methods based on large language models (LLMs) and text embedding models (TEMs) under identical conditions. Through controlled experiments on three public datasets, TEMs consistently outperform LLM rerankers in both narrow and broad cold-start settings, with TEMs trained via LLM supervision (e.g., Qwen embeddings) achieving the strongest results. The findings challenge the assumption that LLMs are always optimal for training-free scenarios and highlight TEM-based approaches as more scalable and reliable for TFCSR. The work also provides insights into error patterns, the impact of user history size, and cross-domain transfer, outlining directions for integrating structured data and synthetic training signals.

Abstract

Recommender systems usually rely on large-scale interaction data to learn from users' past behaviors and make accurate predictions. However, real-world applications often face situations where no training data is available, such as when launching new services or handling entirely new users. In such cases, conventional approaches cannot be applied. This study focuses on training-free recommendation, where no task-specific training is performed, and particularly on \textit{training-free cold-start recommendation} (TFCSR), the more challenging case where the target user has no interactions. Large language models (LLMs) have recently been explored as a promising solution, and numerous studies have been proposed. As the ability of text embedding models (TEMs) increases, they are increasingly recognized as applicable to training-free recommendation, but no prior work has directly compared LLMs and TEMs under identical conditions. We present the first controlled experiments that systematically evaluate these two approaches in the same setting. The results show that TEMs outperform LLM rerankers, and this trend holds not only in cold-start settings but also in warm-start settings with rich interactions. These findings indicate that direct LLM ranking is not the only viable option, contrary to the commonly shared belief, and TEM-based approaches provide a stronger and more scalable basis for training-free recommendation.

Paper Structure

This paper contains 19 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Pie charts of user-level win/loss relationships. Each chart shows, for every user, which model obtained the higher score. Users with identical scores are grouped into the "same" region.
  • Figure 2: Recall@10 and nDCG@10 for various interaction sizes $m$. Statistical significance is evaluated using the one-sided Wilcoxon signed-rank test against gpt-4.1; $*$ and $\bigtriangledown$ denote results significantly higher or lower at $p=10^{-4}$. The horizontal dotted line represents the score of a random ranking.
  • Figure 3: Recall@5 and nDCG@5 when the number of candidate items is 10. The symbols $*$, $\bigtriangledown$, and the horizontal dotted line follow the same definitions as in Figure \ref{['fig:num_icl']}.
  • Figure 4: Relative improvement (%) over Raw-Raw for different TEMs. The light gray area marks the range from $-5\%$ to $5\%$.
  • Figure 5: Recall@10 and nDCG@10 for the cross-domain setting with various $m'$.