Table of Contents
Fetching ...

Item-Language Model for Conversational Recommendation

Li Yang, Anushya Subbiah, Hardik Patel, Judith Yue Li, Yanwei Song, Reza Mirghaderi, Vikram Aggarwal, Qifan Wang

TL;DR

The paper tackles how to fuse user interaction signals with language models for conversational recommendation without finetuning the backbone LLM. It introduces Item-Language Model (ILM), a two-phase approach where a Q-Former item encoder translates collaborative-filtering embeddings into text-aligned item representations, which are then integrated with a frozen LLM via a projection adaptor. Phase 1 optimizes item-text, item-text generation, item-text matching, and a novel item-item contrastive loss to enrich representations; Phase 2 freezes the LLM and trains only the encoder and adaptor on multitask conversational recommendation. Across ELM 24 and OpenP5 benchmarks, ILM consistently outperforms baselines, highlighting the importance of language alignment and leveraging interaction signals to shape item representations while preserving pretrained language capabilities. This framework enables strong, scalable conversational recommendations with reduced privacy risk and supports multi-turn tool use in dialogue systems.

Abstract

Large-language Models (LLMs) have been extremely successful at tasks like complex dialogue understanding, reasoning and coding due to their emergent abilities. These emergent abilities have been extended with multi-modality to include image, audio, and video capabilities. Recommender systems, on the other hand, have been critical for information seeking and item discovery needs. Recently, there have been attempts to apply LLMs for recommendations. One difficulty of current attempts is that the underlying LLM is usually not trained on the recommender system data, which largely contains user interaction signals and is often not publicly available. Another difficulty is user interaction signals often have a different pattern from natural language text, and it is currently unclear if the LLM training setup can learn more non-trivial knowledge from interaction signals compared with traditional recommender system methods. Finally, it is difficult to train multiple LLMs for different use-cases, and to retain the original language and reasoning abilities when learning from recommender system data. To address these three limitations, we propose an Item-Language Model (ILM), which is composed of an item encoder to produce text-aligned item representations that encode user interaction signals, and a frozen LLM that can understand those item representations with preserved pretrained knowledge. We conduct extensive experiments which demonstrate both the importance of the language-alignment and of user interaction knowledge in the item encoder.

Item-Language Model for Conversational Recommendation

TL;DR

The paper tackles how to fuse user interaction signals with language models for conversational recommendation without finetuning the backbone LLM. It introduces Item-Language Model (ILM), a two-phase approach where a Q-Former item encoder translates collaborative-filtering embeddings into text-aligned item representations, which are then integrated with a frozen LLM via a projection adaptor. Phase 1 optimizes item-text, item-text generation, item-text matching, and a novel item-item contrastive loss to enrich representations; Phase 2 freezes the LLM and trains only the encoder and adaptor on multitask conversational recommendation. Across ELM 24 and OpenP5 benchmarks, ILM consistently outperforms baselines, highlighting the importance of language alignment and leveraging interaction signals to shape item representations while preserving pretrained language capabilities. This framework enables strong, scalable conversational recommendations with reduced privacy risk and supports multi-turn tool use in dialogue systems.

Abstract

Large-language Models (LLMs) have been extremely successful at tasks like complex dialogue understanding, reasoning and coding due to their emergent abilities. These emergent abilities have been extended with multi-modality to include image, audio, and video capabilities. Recommender systems, on the other hand, have been critical for information seeking and item discovery needs. Recently, there have been attempts to apply LLMs for recommendations. One difficulty of current attempts is that the underlying LLM is usually not trained on the recommender system data, which largely contains user interaction signals and is often not publicly available. Another difficulty is user interaction signals often have a different pattern from natural language text, and it is currently unclear if the LLM training setup can learn more non-trivial knowledge from interaction signals compared with traditional recommender system methods. Finally, it is difficult to train multiple LLMs for different use-cases, and to retain the original language and reasoning abilities when learning from recommender system data. To address these three limitations, we propose an Item-Language Model (ILM), which is composed of an item encoder to produce text-aligned item representations that encode user interaction signals, and a frozen LLM that can understand those item representations with preserved pretrained knowledge. We conduct extensive experiments which demonstrate both the importance of the language-alignment and of user interaction knowledge in the item encoder.
Paper Structure (23 sections, 4 equations, 3 figures, 9 tables)

This paper contains 23 sections, 4 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Conversational recommendation tasks using ILM. User and item collaborative filtering embeddings, marked by placeholders in the input, are interleaved with text embeddings and fed to the model. Where {history} is a sequence of items.
  • Figure 2: Overall model architecture for ILM. (a) The original item-text contrastive, item-grounded text generation and item-text matching losses used in BLIP-2 blip2 in Q-Former phase 1 training. (b) The new item-item contrastive loss we introduced in Q-Former phase 1 training. For user-item contrastive learning, we simply replace item collaborative filtering (CF) embedding with user CF embedding. (c) A schematic of how item-item contrastive learning can improve text-aligned item representations. (d) The ILM phase 2 training by integrating the Q-Former to a frozen LLM.
  • Figure 3: Effects of Number of Query Tokens.