Table of Contents
Fetching ...

Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion

Hai-Dang Kieu, Min Xu, Thanh Trung Huynh, Dung D. Le

TL;DR

VIRAL addresses the challenge of effectively fusing visual and textual content in multimodal recommendation by combining a VLM-guided visual enrichment that generates fine-grained, title-aware descriptions with an information-aware fusion mechanism grounded in Partial Information Decomposition. The approach uses graph-based multimodal interaction, a cross-modal Transformer to capture synergy, and Transformer-based redundancy estimation to isolate unique visual information, optimizing with InfoNCE-based losses alongside a BPR objective. Experiments on three Amazon datasets show VIRAL consistently outperforms strong baselines and strengthens visual modality contribution, with ablations confirming the necessity of both VLM enrichment and information-aware fusion. The work enhances interpretability and robustness of multimodal recommendations and suggests future work in joint VLM-recommender training for task-adaptive vision–language understanding.

Abstract

Recent advances in multimodal recommendation (MMR) highlight the potential of integrating visual and textual content to enrich item representations. However, existing methods often rely on coarse visual features and naive fusion strategies, resulting in redundant or misaligned representations. From an information-theoretic perspective, effective fusion should balance unique, shared, and redundant modality information to preserve complementary cues. To this end, we propose VIRAL, a novel Vision-Language and Information-aware Recommendation framework that enhances multimodal fusion through two components: (i) a VLM-based visual enrichment module that generates fine-grained, title-guided descriptions for semantically aligned image representations, and (ii) an information-aware fusion module inspired by Partial Information Decomposition (PID) to disentangle and integrate complementary signals. Experiments on three Amazon datasets show that VIRAL consistently outperforms strong multimodal baselines and substantially improves the contribution of visual features.

Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion

TL;DR

VIRAL addresses the challenge of effectively fusing visual and textual content in multimodal recommendation by combining a VLM-guided visual enrichment that generates fine-grained, title-aware descriptions with an information-aware fusion mechanism grounded in Partial Information Decomposition. The approach uses graph-based multimodal interaction, a cross-modal Transformer to capture synergy, and Transformer-based redundancy estimation to isolate unique visual information, optimizing with InfoNCE-based losses alongside a BPR objective. Experiments on three Amazon datasets show VIRAL consistently outperforms strong baselines and strengthens visual modality contribution, with ablations confirming the necessity of both VLM enrichment and information-aware fusion. The work enhances interpretability and robustness of multimodal recommendations and suggests future work in joint VLM-recommender training for task-adaptive vision–language understanding.

Abstract

Recent advances in multimodal recommendation (MMR) highlight the potential of integrating visual and textual content to enrich item representations. However, existing methods often rely on coarse visual features and naive fusion strategies, resulting in redundant or misaligned representations. From an information-theoretic perspective, effective fusion should balance unique, shared, and redundant modality information to preserve complementary cues. To this end, we propose VIRAL, a novel Vision-Language and Information-aware Recommendation framework that enhances multimodal fusion through two components: (i) a VLM-based visual enrichment module that generates fine-grained, title-guided descriptions for semantically aligned image representations, and (ii) an information-aware fusion module inspired by Partial Information Decomposition (PID) to disentangle and integrate complementary signals. Experiments on three Amazon datasets show that VIRAL consistently outperforms strong multimodal baselines and substantially improves the contribution of visual features.

Paper Structure

This paper contains 11 sections, 17 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: [Top] Our VIRAL outperforms recent SOTA multimodal models. [Bottom] Using VLM to generate visual description of item.
  • Figure 2: Our VIRAL pipeline
  • Figure 3: VLM enriches visual features
  • Figure 4: Experiments on Baby. [Top] Item embedding visualization. [Bottom] Performance of VLM models.