FLARE: Fusing Language Models and Collaborative Architectures for Recommender Enhancement
Liam Hebert, Marialena Kyriakidi, Hubert Pham, Krishna Sayana, James Pine, Sukhdeep Sodhi, Ambarish Jash
TL;DR
Problem: conventional item-ID recommender baselines can be optimistic unless baselines are carefully tuned; text metadata shows promise but requires fair comparison. Approach: Flare combines ID embeddings with a frozen text encoder via a Perceiver to fuse textual item context with collaborative signals, trained with MLM and a contrastive loss, plus a critiquing modality to steer predictions. Contributions: updated strong Bert4Rec baselines, competitive performance of Flare on standard and large-vocabulary datasets, introduction of critiquing as an evaluation method, and comprehensive ablations validating component roles. Significance: demonstrates a scalable, critique-enabled hybrid framework for leveraging textual context in web-scale recommendations, with implications for practical deployment and future research on vocabulary size and in-domain LM tuning.
Abstract
Recent proposals in recommender systems represent items with their textual description, using a large language model. They show better results on standard benchmarks compared to an item ID-only model, such as Bert4Rec. In this work, we revisit the often-used Bert4Rec baseline and show that with further tuning, Bert4Rec significantly outperforms previously reported numbers, and in some datasets, is competitive with state-of-the-art models. With revised baselines for item ID-only models, this paper also establishes new competitive results for architectures that combine IDs and textual descriptions. We demonstrate this with Flare (Fusing Language models and collaborative Architectures for Recommender Enhancement). Flare is a novel hybrid sequence recommender that integrates a language model with a collaborative filtering model using a Perceiver network. Prior studies focus evaluation on datasets with limited-corpus size, but many commercially-applicable recommender systems common on the web must handle larger corpora. We evaluate Flare on a more realistic dataset with a significantly larger item vocabulary, introducing new baselines for this setting. This paper also showcases Flare's inherent ability to support critiquing, enabling users to provide feedback and refine recommendations. We leverage critiquing as an evaluation method to assess the model's language understanding and its transferability to the recommendation task.
