Table of Contents
Fetching ...

CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation

Hao Zhan, Yihui Wang, Yonghui Yang, Danyang Yue, Yu Wang, Pengyang Shao, Fei Shen, Fei Liu, Le Wu

TL;DR

CLEAR is a lightweight and plug-and-play cross-modal de-redundancy approach for multimodal recommendation that reshapes the multimodal representation space by suppressing redundant cross-modal components while preserving modality-specific information.

Abstract

Multimodal recommendation has emerged as an effective paradigm for enhancing collaborative filtering by incorporating heterogeneous content modalities. Existing multimodal recommenders predominantly focus on reinforcing cross-modal consistency to facilitate multimodal fusion. However, we observe that multimodal representations often exhibit substantial cross-modal redundancy, where dominant shared components overlap across modalities. Such redundancy can limit the effective utilization of complementary information, explaining why incorporating additional modalities does not always yield performance improvements. In this work, we propose CLEAR, a lightweight and plug-and-play cross-modal de-redundancy approach for multimodal recommendation. Rather than enforcing stronger cross-modal alignment, CLEAR explicitly characterizes the redundant shared subspace across modalities by modeling cross-modal covariance between visual and textual representations. By identifying dominant shared directions via singular value decomposition and projecting multimodal features onto the complementary null space, CLEAR reshapes the multimodal representation space by suppressing redundant cross-modal components while preserving modality-specific information. This subspace-level projection implicitly regulates representation learning dynamics, preventing the model from repeatedly amplifying redundant shared semantics during training. Notably, CLEAR can be seamlessly integrated into existing multimodal recommenders without modifying their architectures or training objectives. Extensive experiments on three public benchmark datasets demonstrate that explicitly reducing cross-modal redundancy consistently improves recommendation performance across a wide range of multimodal recommendation models.

CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation

TL;DR

CLEAR is a lightweight and plug-and-play cross-modal de-redundancy approach for multimodal recommendation that reshapes the multimodal representation space by suppressing redundant cross-modal components while preserving modality-specific information.

Abstract

Multimodal recommendation has emerged as an effective paradigm for enhancing collaborative filtering by incorporating heterogeneous content modalities. Existing multimodal recommenders predominantly focus on reinforcing cross-modal consistency to facilitate multimodal fusion. However, we observe that multimodal representations often exhibit substantial cross-modal redundancy, where dominant shared components overlap across modalities. Such redundancy can limit the effective utilization of complementary information, explaining why incorporating additional modalities does not always yield performance improvements. In this work, we propose CLEAR, a lightweight and plug-and-play cross-modal de-redundancy approach for multimodal recommendation. Rather than enforcing stronger cross-modal alignment, CLEAR explicitly characterizes the redundant shared subspace across modalities by modeling cross-modal covariance between visual and textual representations. By identifying dominant shared directions via singular value decomposition and projecting multimodal features onto the complementary null space, CLEAR reshapes the multimodal representation space by suppressing redundant cross-modal components while preserving modality-specific information. This subspace-level projection implicitly regulates representation learning dynamics, preventing the model from repeatedly amplifying redundant shared semantics during training. Notably, CLEAR can be seamlessly integrated into existing multimodal recommenders without modifying their architectures or training objectives. Extensive experiments on three public benchmark datasets demonstrate that explicitly reducing cross-modal redundancy consistently improves recommendation performance across a wide range of multimodal recommendation models.
Paper Structure (45 sections, 8 theorems, 33 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 45 sections, 8 theorems, 33 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Decompose features as $\mathbf{v} = \mathbf{v}_{\parallel} + \mathbf{v}_{\perp}$ where $\mathbf{v}_{\parallel} = \mathbf{U}_k\mathbf{U}_k^\top\mathbf{v}$ lies in redundant subspace. For items from different classes $c_i \neq c_j$, the inter-class separation measured by Fisher's Linear Discriminant s where $\text{FDR}_{\perp}$ is computed analogously for $\mathbf{v}_{\perp}$.

Figures (6)

  • Figure 1: Illustration of cross-modal redundancy. Left: density distributions of similarity scores for items retrieved by visual and textual modalities. Right: overlap ratios between the top-K retrieved items across modalities.
  • Figure 2: The overall architecture of CLEAR. Visual and textual features are encoded, then a cross-modal covariance matrix $\mathbf{C}$ is constructed and decomposed via SVD to identify top-$k$ redundant directions. Null-space projections $\mathbf{P}_V$ and $\mathbf{P}_T$ suppress redundancy while preserving modality-specific information for downstream graph-based recommendation.
  • Figure 3: Ablation study comparing variants on three datasets. The bar chart shows R@20 and N@20 metrics for: Full Model (with null-space projection and fixed top-$k$), w/o null-space, and Dynamic Ratio (adaptive ratio-based selection).
  • Figure 4: Hyperparameter analysis on Clothing and Sports datasets. Heatmaps show R@20 performance across different $\lambda$ and $k$ combinations. Darker colors indicate better results.
  • Figure 5: Singular value distribution of cross-modal covariance matrix before and after null-space projection.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Theorem 1: Discriminability Degradation
  • proposition 1: Redundancy Amplification in Fusion
  • corollary 1: Imbalance Ratio
  • Theorem 2: Fast Convergence on Redundant Subspace
  • Theorem 3: Singular Value Suppression
  • Theorem 4: SVD Preferentially Captures Coarse-Grained Redundancy
  • corollary 2: Task Irrelevance of Coarse Information
  • Theorem 5: Soft Projection as Risk Minimization