CALRec: Contrastive Alignment of Generative LLMs for Sequential Recommendation
Yaoyiran Li, Xiang Zhai, Moustafa Alzantot, Keyi Yu, Ivan Vulić, Anna Korhonen, Mohamed Hammad
TL;DR
CALRec reframes sequential recommendation as a text-to-text task using pretrained LLMs and introduces a two-stage fine-tuning framework with mixed objectives that combine next-item generation and contrastive alignment. The approach leverages carefully designed prompts, multi-domain pretraining, and a quasi-round-robin BM25 retrieval to produce accurate next-item predictions, achieving notable gains over state-of-the-art baselines. Key contributions include the two-stage training paradigm, the L_NIG/L_TT/L_UT objective mix, and the retrieval mechanism that integrates generation with lexical ranking. The work demonstrates the practical viability of large-language-models for sequential recommendations, while also discussing limitations (e.g., item cold-start) and potential reranking strategies for large catalogs.
Abstract
Traditional recommender systems such as matrix factorization methods have primarily focused on learning a shared dense embedding space to represent both items and user preferences. Subsequently, sequence models such as RNN, GRUs, and, recently, Transformers have emerged and excelled in the task of sequential recommendation. This task requires understanding the sequential structure present in users' historical interactions to predict the next item they may like. Building upon the success of Large Language Models (LLMs) in a variety of tasks, researchers have recently explored using LLMs that are pretrained on vast corpora of text for sequential recommendation. To use LLMs for sequential recommendation, both the history of user interactions and the model's prediction of the next item are expressed in text form. We propose CALRec, a two-stage LLM finetuning framework that finetunes a pretrained LLM in a two-tower fashion using a mixture of two contrastive losses and a language modeling loss: the LLM is first finetuned on a data mixture from multiple domains followed by another round of target domain finetuning. Our model significantly outperforms many state-of-the-art baselines (+37% in Recall@1 and +24% in NDCG@10) and our systematic ablation studies reveal that (i) both stages of finetuning are crucial, and, when combined, we achieve improved performance, and (ii) contrastive alignment is effective among the target domains explored in our experiments.
