Table of Contents
Fetching ...

Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation

Xinglong Wu, Anfeng Huang, Hongwei Yang, Hui He, Yu Tai, Weizhe Zhang

TL;DR

The paper tackles the cross-modal semantic gap in multi-modal recommender systems by introducing CLIPER, a model-agnostic framework that leverages multi-view prompts and CLIP-based cross-modal alignment to produce richer item representations. It extracts multiple semantic views from text fields, computes cross-modal similarities with images, and fuses these signals through a self-attention driven Fusion Layer to feed downstream backbones. Across three real-world datasets and multiple baselines, CLIPER consistently improves Recall@K and NDCG@K, with notable gains for FREEDOM-CLIPER; longer textual inputs via Long-CLIP further boost performance. The work demonstrates the practicality and effectiveness of incorporating cross-modal alignment into MMRS and provides insights into view importance and fusion strategies, offering a scalable, plug-and-play enhancement for existing systems.

Abstract

Multi-modal recommendation greatly enhances the performance of recommender systems by modeling the auxiliary information from multi-modality contents. Most existing multi-modal recommendation models primarily exploit multimedia information propagation processes to enrich item representations and directly utilize modal-specific embedding vectors independently obtained from upstream pre-trained models. However, this might be inappropriate since the abundant task-specific semantics remain unexplored, and the cross-modality semantic gap hinders the recommendation performance. Inspired by the recent progress of the cross-modal alignment model CLIP, in this paper, we propose a novel \textbf{CLIP} \textbf{E}nhanced \textbf{R}ecommender (\textbf{CLIPER}) framework to bridge the semantic gap between modalities and extract fine-grained multi-view semantic information. Specifically, we introduce a multi-view modality-alignment approach for representation extraction and measure the semantic similarity between modalities. Furthermore, we integrate the multi-view multimedia representations into downstream recommendation models. Extensive experiments conducted on three public datasets demonstrate the consistent superiority of our model over state-of-the-art multi-modal recommendation models.

Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation

TL;DR

The paper tackles the cross-modal semantic gap in multi-modal recommender systems by introducing CLIPER, a model-agnostic framework that leverages multi-view prompts and CLIP-based cross-modal alignment to produce richer item representations. It extracts multiple semantic views from text fields, computes cross-modal similarities with images, and fuses these signals through a self-attention driven Fusion Layer to feed downstream backbones. Across three real-world datasets and multiple baselines, CLIPER consistently improves Recall@K and NDCG@K, with notable gains for FREEDOM-CLIPER; longer textual inputs via Long-CLIP further boost performance. The work demonstrates the practicality and effectiveness of incorporating cross-modal alignment into MMRS and provides insights into view importance and fusion strategies, offering a scalable, plug-and-play enhancement for existing systems.

Abstract

Multi-modal recommendation greatly enhances the performance of recommender systems by modeling the auxiliary information from multi-modality contents. Most existing multi-modal recommendation models primarily exploit multimedia information propagation processes to enrich item representations and directly utilize modal-specific embedding vectors independently obtained from upstream pre-trained models. However, this might be inappropriate since the abundant task-specific semantics remain unexplored, and the cross-modality semantic gap hinders the recommendation performance. Inspired by the recent progress of the cross-modal alignment model CLIP, in this paper, we propose a novel \textbf{CLIP} \textbf{E}nhanced \textbf{R}ecommender (\textbf{CLIPER}) framework to bridge the semantic gap between modalities and extract fine-grained multi-view semantic information. Specifically, we introduce a multi-view modality-alignment approach for representation extraction and measure the semantic similarity between modalities. Furthermore, we integrate the multi-view multimedia representations into downstream recommendation models. Extensive experiments conducted on three public datasets demonstrate the consistent superiority of our model over state-of-the-art multi-modal recommendation models.
Paper Structure (30 sections, 3 equations, 6 figures, 3 tables)

This paper contains 30 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Schematic illustration of the workflow of our proposed CLIPER.
  • Figure 2: Impact of Individual Views.
  • Figure 3: Impact of Fusion Layer.
  • Figure 4: Impact of Embedding Size $d$.
  • Figure 5: Visualization.
  • ...and 1 more figures