CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation

Hao Zhan; Yihui Wang; Yonghui Yang; Danyang Yue; Yu Wang; Pengyang Shao; Fei Shen; Fei Liu; Le Wu

CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation

Hao Zhan, Yihui Wang, Yonghui Yang, Danyang Yue, Yu Wang, Pengyang Shao, Fei Shen, Fei Liu, Le Wu

TL;DR

CLEAR is a lightweight and plug-and-play cross-modal de-redundancy approach for multimodal recommendation that reshapes the multimodal representation space by suppressing redundant cross-modal components while preserving modality-specific information.

Abstract

Multimodal recommendation has emerged as an effective paradigm for enhancing collaborative filtering by incorporating heterogeneous content modalities. Existing multimodal recommenders predominantly focus on reinforcing cross-modal consistency to facilitate multimodal fusion. However, we observe that multimodal representations often exhibit substantial cross-modal redundancy, where dominant shared components overlap across modalities. Such redundancy can limit the effective utilization of complementary information, explaining why incorporating additional modalities does not always yield performance improvements. In this work, we propose CLEAR, a lightweight and plug-and-play cross-modal de-redundancy approach for multimodal recommendation. Rather than enforcing stronger cross-modal alignment, CLEAR explicitly characterizes the redundant shared subspace across modalities by modeling cross-modal covariance between visual and textual representations. By identifying dominant shared directions via singular value decomposition and projecting multimodal features onto the complementary null space, CLEAR reshapes the multimodal representation space by suppressing redundant cross-modal components while preserving modality-specific information. This subspace-level projection implicitly regulates representation learning dynamics, preventing the model from repeatedly amplifying redundant shared semantics during training. Notably, CLEAR can be seamlessly integrated into existing multimodal recommenders without modifying their architectures or training objectives. Extensive experiments on three public benchmark datasets demonstrate that explicitly reducing cross-modal redundancy consistently improves recommendation performance across a wide range of multimodal recommendation models.

CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation

TL;DR

Abstract

Paper Structure (45 sections, 8 theorems, 33 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 45 sections, 8 theorems, 33 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Preliminaries
Problem Definition
Multimodal Recommendation Formulation
Methodology
Overview of CLEAR
Cross-Modal Redundancy Modeling
Characterizing Redundancy and Its Interference Mechanisms
Cross-Modal Covariance Construction
Localizing Redundancy via SVD
Null-Space Projection for De-redundancy
Theoretical Foundation: Low-Rank Redundancy Hypothesis
Optimization and Training
Experiments
Experimental Settings
...and 30 more sections

Key Result

Theorem 1

Decompose features as $\mathbf{v} = \mathbf{v}_{\parallel} + \mathbf{v}_{\perp}$ where $\mathbf{v}_{\parallel} = \mathbf{U}_k\mathbf{U}_k^\top\mathbf{v}$ lies in redundant subspace. For items from different classes $c_i \neq c_j$, the inter-class separation measured by Fisher's Linear Discriminant s where $\text{FDR}_{\perp}$ is computed analogously for $\mathbf{v}_{\perp}$.

Figures (6)

Figure 1: Illustration of cross-modal redundancy. Left: density distributions of similarity scores for items retrieved by visual and textual modalities. Right: overlap ratios between the top-K retrieved items across modalities.
Figure 2: The overall architecture of CLEAR. Visual and textual features are encoded, then a cross-modal covariance matrix $\mathbf{C}$ is constructed and decomposed via SVD to identify top-$k$ redundant directions. Null-space projections $\mathbf{P}_V$ and $\mathbf{P}_T$ suppress redundancy while preserving modality-specific information for downstream graph-based recommendation.
Figure 3: Ablation study comparing variants on three datasets. The bar chart shows R@20 and N@20 metrics for: Full Model (with null-space projection and fixed top-$k$), w/o null-space, and Dynamic Ratio (adaptive ratio-based selection).
Figure 4: Hyperparameter analysis on Clothing and Sports datasets. Heatmaps show R@20 performance across different $\lambda$ and $k$ combinations. Darker colors indicate better results.
Figure 5: Singular value distribution of cross-modal covariance matrix before and after null-space projection.
...and 1 more figures

Theorems & Definitions (8)

Theorem 1: Discriminability Degradation
proposition 1: Redundancy Amplification in Fusion
corollary 1: Imbalance Ratio
Theorem 2: Fast Convergence on Redundant Subspace
Theorem 3: Singular Value Suppression
Theorem 4: SVD Preferentially Captures Coarse-Grained Redundancy
corollary 2: Task Irrelevance of Coarse Information
Theorem 5: Soft Projection as Risk Minimization

CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation

TL;DR

Abstract

CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (8)