Table of Contents
Fetching ...

Bridging Collaborative Filtering and Large Language Models with Dynamic Alignment, Multimodal Fusion and Evidence-grounded Explanations

Bo Ma, LuYao Liu, Simon Lau, Chandler Yuan, and XueY Cui, Rosie Zhang

TL;DR

This work tackles three practical challenges in LLM-assisted recommendations: evolving user preferences, multimodal item content, and trustworthy explanations. It introduces DynMM-Explain-LLMRec, which combines a lightweight online adapter for dynamic alignment, a shared multimodal representation for visual and audio content, and an evidence-grounded explanation generator, all while keeping the base models frozen. The framework optimizes a rich objective including ranking, alignment, reconstruction, stability, and faithfulness terms, plus cross-modal contrastive losses. Experiments across multiple datasets show consistent improvements over strong baselines with modest latency overhead, demonstrating production-ready viability for real-time, multimodal recommendation with transparent rationales. The approach advances practical, interpretable LLM-enhanced recommendations suitable for streaming, cold-start, and diverse content domains.

Abstract

Recent research has explored using Large Language Models for recommendation tasks by transforming user interaction histories and item metadata into text prompts, then having the LLM produce rankings or recommendations. A promising approach involves connecting collaborative filtering knowledge to LLM representations through compact adapter networks, which avoids expensive fine-tuning while preserving the strengths of both components. Yet several challenges persist in practice: collaborative filtering models often use static snapshots that miss rapidly changing user preferences; many real-world items contain rich visual and audio content beyond textual descriptions; and current systems struggle to provide trustworthy explanations backed by concrete evidence. Our work introduces \model{}, a framework that tackles these limitations through three key innovations. We develop an online adaptation mechanism that continuously incorporates new user interactions through lightweight modules, avoiding the need to retrain large models. We create a unified representation that seamlessly combines collaborative signals with visual and audio features, handling cases where some modalities may be unavailable. Finally, we design an explanation system that grounds recommendations in specific collaborative patterns and item attributes, producing natural language rationales users can verify. Our approach maintains the efficiency of frozen base models while adding minimal computational overhead, making it practical for real-world deployment.

Bridging Collaborative Filtering and Large Language Models with Dynamic Alignment, Multimodal Fusion and Evidence-grounded Explanations

TL;DR

This work tackles three practical challenges in LLM-assisted recommendations: evolving user preferences, multimodal item content, and trustworthy explanations. It introduces DynMM-Explain-LLMRec, which combines a lightweight online adapter for dynamic alignment, a shared multimodal representation for visual and audio content, and an evidence-grounded explanation generator, all while keeping the base models frozen. The framework optimizes a rich objective including ranking, alignment, reconstruction, stability, and faithfulness terms, plus cross-modal contrastive losses. Experiments across multiple datasets show consistent improvements over strong baselines with modest latency overhead, demonstrating production-ready viability for real-time, multimodal recommendation with transparent rationales. The approach advances practical, interpretable LLM-enhanced recommendations suitable for streaming, cold-start, and diverse content domains.

Abstract

Recent research has explored using Large Language Models for recommendation tasks by transforming user interaction histories and item metadata into text prompts, then having the LLM produce rankings or recommendations. A promising approach involves connecting collaborative filtering knowledge to LLM representations through compact adapter networks, which avoids expensive fine-tuning while preserving the strengths of both components. Yet several challenges persist in practice: collaborative filtering models often use static snapshots that miss rapidly changing user preferences; many real-world items contain rich visual and audio content beyond textual descriptions; and current systems struggle to provide trustworthy explanations backed by concrete evidence. Our work introduces \model{}, a framework that tackles these limitations through three key innovations. We develop an online adaptation mechanism that continuously incorporates new user interactions through lightweight modules, avoiding the need to retrain large models. We create a unified representation that seamlessly combines collaborative signals with visual and audio features, handling cases where some modalities may be unavailable. Finally, we design an explanation system that grounds recommendations in specific collaborative patterns and item attributes, producing natural language rationales users can verify. Our approach maintains the efficiency of frozen base models while adding minimal computational overhead, making it practical for real-world deployment.

Paper Structure

This paper contains 20 sections, 5 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: DynMM-Explain-LLMRec overview: a frozen base aligner fuses CF and text; a tiny online adapter absorbs fresh interactions; a multimodal branch adds vision/audio; soft tokens prompt a frozen LLM; evidence tokens (not shown) ground explanations.
  • Figure 2: Evidence-grounded explanation example.
  • Figure 3: Prompt template sketch used in inference.