Intent Representation Learning with Large Language Model for Recommendation
Yu Wang, Lei Sang, Yi Zhang, Yiwen Zhang
TL;DR
This work tackles multimodal intent modeling for recommender systems by introducing IRLLRec, a model-agnostic framework that uses LLMs to construct textual intents and a dual-tower architecture to fuse textual and interaction-based intents. It couples pairwise and translation alignment to bridge representation gaps and employs momentum distillation for Interaction-Text Matching to robustly align multimodal intents. Across three real-world datasets, IRLLRec shows consistent performance gains over strong baselines and demonstrates particular strength under data sparsity, with ablations confirming the central roles of intent alignment and matching. The approach advances interpretability and robustness in recommendations by explicitly modeling fine-grained latent intents from both textual descriptions and user-item interactions.
Abstract
Intent-based recommender systems have garnered significant attention for uncovering latent fine-grained preferences. Intents, as underlying factors of interactions, are crucial for improving recommendation interpretability. Most methods define intents as learnable parameters updated alongside interactions. However, existing frameworks often overlook textual information (e.g., user reviews, item descriptions), which is crucial for alleviating the sparsity of interaction intents. Exploring these multimodal intents, especially the inherent differences in representation spaces, poses two key challenges: i) How to align multimodal intents and effectively mitigate noise issues; ii) How to extract and match latent key intents across modalities. To tackle these challenges, we propose a model-agnostic framework, Intent Representation Learning with Large Language Model (IRLLRec), which leverages large language models (LLMs) to construct multimodal intents and enhance recommendations. Specifically, IRLLRec employs a dual-tower architecture to learn multimodal intent representations. Next, we propose pairwise and translation alignment to eliminate inter-modal differences and enhance robustness against noisy input features. Finally, to better match textual and interaction-based intents, we employ momentum distillation to perform teacher-student learning on fused intent representations. Empirical evaluations on three datasets show that our IRLLRec framework outperforms baselines.Code available at https://github.com/wangyu0627/IRLLRec.
