Table of Contents
Fetching ...

Intent Representation Learning with Large Language Model for Recommendation

Yu Wang, Lei Sang, Yi Zhang, Yiwen Zhang

TL;DR

This work tackles multimodal intent modeling for recommender systems by introducing IRLLRec, a model-agnostic framework that uses LLMs to construct textual intents and a dual-tower architecture to fuse textual and interaction-based intents. It couples pairwise and translation alignment to bridge representation gaps and employs momentum distillation for Interaction-Text Matching to robustly align multimodal intents. Across three real-world datasets, IRLLRec shows consistent performance gains over strong baselines and demonstrates particular strength under data sparsity, with ablations confirming the central roles of intent alignment and matching. The approach advances interpretability and robustness in recommendations by explicitly modeling fine-grained latent intents from both textual descriptions and user-item interactions.

Abstract

Intent-based recommender systems have garnered significant attention for uncovering latent fine-grained preferences. Intents, as underlying factors of interactions, are crucial for improving recommendation interpretability. Most methods define intents as learnable parameters updated alongside interactions. However, existing frameworks often overlook textual information (e.g., user reviews, item descriptions), which is crucial for alleviating the sparsity of interaction intents. Exploring these multimodal intents, especially the inherent differences in representation spaces, poses two key challenges: i) How to align multimodal intents and effectively mitigate noise issues; ii) How to extract and match latent key intents across modalities. To tackle these challenges, we propose a model-agnostic framework, Intent Representation Learning with Large Language Model (IRLLRec), which leverages large language models (LLMs) to construct multimodal intents and enhance recommendations. Specifically, IRLLRec employs a dual-tower architecture to learn multimodal intent representations. Next, we propose pairwise and translation alignment to eliminate inter-modal differences and enhance robustness against noisy input features. Finally, to better match textual and interaction-based intents, we employ momentum distillation to perform teacher-student learning on fused intent representations. Empirical evaluations on three datasets show that our IRLLRec framework outperforms baselines.Code available at https://github.com/wangyu0627/IRLLRec.

Intent Representation Learning with Large Language Model for Recommendation

TL;DR

This work tackles multimodal intent modeling for recommender systems by introducing IRLLRec, a model-agnostic framework that uses LLMs to construct textual intents and a dual-tower architecture to fuse textual and interaction-based intents. It couples pairwise and translation alignment to bridge representation gaps and employs momentum distillation for Interaction-Text Matching to robustly align multimodal intents. Across three real-world datasets, IRLLRec shows consistent performance gains over strong baselines and demonstrates particular strength under data sparsity, with ablations confirming the central roles of intent alignment and matching. The approach advances interpretability and robustness in recommendations by explicitly modeling fine-grained latent intents from both textual descriptions and user-item interactions.

Abstract

Intent-based recommender systems have garnered significant attention for uncovering latent fine-grained preferences. Intents, as underlying factors of interactions, are crucial for improving recommendation interpretability. Most methods define intents as learnable parameters updated alongside interactions. However, existing frameworks often overlook textual information (e.g., user reviews, item descriptions), which is crucial for alleviating the sparsity of interaction intents. Exploring these multimodal intents, especially the inherent differences in representation spaces, poses two key challenges: i) How to align multimodal intents and effectively mitigate noise issues; ii) How to extract and match latent key intents across modalities. To tackle these challenges, we propose a model-agnostic framework, Intent Representation Learning with Large Language Model (IRLLRec), which leverages large language models (LLMs) to construct multimodal intents and enhance recommendations. Specifically, IRLLRec employs a dual-tower architecture to learn multimodal intent representations. Next, we propose pairwise and translation alignment to eliminate inter-modal differences and enhance robustness against noisy input features. Finally, to better match textual and interaction-based intents, we employ momentum distillation to perform teacher-student learning on fused intent representations. Empirical evaluations on three datasets show that our IRLLRec framework outperforms baselines.Code available at https://github.com/wangyu0627/IRLLRec.

Paper Structure

This paper contains 25 sections, 19 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a) User-item interaction bipartite graph; (b) Disentangled interactions incorporating user intents: $u_1$-$i_3$ is influenced by intents $c_2$ and $c_n$, reflecting a preference for businesses offering fine dining, convenience, and leisure; (c) Gaussian kernel density estimation (KDE 2020kde) visualizes three embedding types: interaction from the pre-trained LightGCN 2020lightgcn, profile from RLMRec 2024rlmrec extracted attribute summaries, and intent from our chain-of-thought reasoning summaries (Figure \ref{['fig:user_intent']}); (d) The text represents user intents, with red for likes and green for dislikes, and lines indicating interaction or non-interaction.
  • Figure 2: Illustration of IRLLRec. Multi Intent Fusion (MIF): MIF takes textual and interaction-based intents as inputs, learning intent embeddings $\mathbf{z}$ and $\mathbf{r}$ through a dual-tower model and fusing them. Intent Alignment (IA): IA bridges spatial discrepancies by aligning two distinct representation spaces. Interaction-text Matching (ITM): ITM employs momentum distillation for teacher-student learning, enabling optimal matching of multimodal intents for users and items.
  • Figure 3: Performance comparison of different sparsity levels. The bar graph shows users' number per group on the left y-axis, and the line graph shows the performance of each method w.r.t. NDCG@20 on the right y-axis.
  • Figure 4: Ablation studies of model variants on the Amazon book and movie datasets w.r.t. Recall@20 and NDCG@20.
  • Figure 5: Case study on Amazon movie.
  • ...and 3 more figures