It is Never Too Late to Mend: Separate Learning for Multimedia Recommendation

Zhuangzhuang He; Zihan Wang; Yonghui Yang; Haoyue Bai; Le Wu

It is Never Too Late to Mend: Separate Learning for Multimedia Recommendation

Zhuangzhuang He, Zihan Wang, Yonghui Yang, Haoyue Bai, Le Wu

TL;DR

This work addresses the plateau in multimedia recommendation when all modalities are fully aligned via self-supervised learning. It introduces Separate Learning (SEA), an information-theoretic framework that decomposes each modality into modal-unique and modal-generic parts and optimizes them with two MI-based objectives: minimizing an upper bound on I between the generic and unique parts to enrich modal-unique features, and maximizing a lower bound on I between the generic parts across modalities to strengthen modal-generic features. SEA uses GNN-based heterogeneous user-item graphs and homogeneous item-item graphs to learn rich, modality-aware representations, and fuses them with a BPR objective to optimize recommendations. Empirical results on three datasets show SEA consistently outperforms strong baselines, with ablations and sensitivity analyses validating the necessity and complementarity of its components. The approach offers a flexible, generalizable framework for disentangling modality-specific and modality-agnostic information in multimodal recommendation, with practical significance for improving personalization performance while preserving modality-specific attributes.

Abstract

Multimedia recommendation, which incorporates various modalities (e.g., images, texts, etc.) into user or item representation to improve recommendation quality, and self-supervised learning carries multimedia recommendation to a plateau of performance, because of its superior performance in aligning different modalities. However, more and more research finds that aligning all modal representations is suboptimal because it damages the unique attributes of each modal. These studies use subtraction and orthogonal constraints in geometric space to learn unique parts. However, our rigorous analysis reveals the flaws in this approach, such as that subtraction does not necessarily yield the desired modal-unique and that orthogonal constraints are ineffective in user and item high-dimensional representation spaces. To make up for the previous weaknesses, we propose Separate Learning (SEA) for multimedia recommendation, which mainly includes mutual information view of modal-unique and -generic learning. Specifically, we first use GNN to learn the representations of users and items in different modalities and split each modal representation into generic and unique parts. We employ contrastive log-ratio upper bound to minimize the mutual information between the general and unique parts within the same modality, to distance their representations, thus learning modal-unique features. Then, we design Solosimloss to maximize the lower bound of mutual information, to align the general parts of different modalities, thus learning more high-quality modal-generic features. Finally, extensive experiments on three datasets demonstrate the effectiveness and generalization of our proposed framework. The code is available at SEA and the full training record of the main experiment.

It is Never Too Late to Mend: Separate Learning for Multimedia Recommendation

TL;DR

Abstract

Paper Structure (26 sections, 3 theorems, 31 equations, 6 figures, 5 tables)

This paper contains 26 sections, 3 theorems, 31 equations, 6 figures, 5 tables.

Introduction
Motivation: Is the Current Paradigm Ideal?
Task Description
The Proposed SEA Framework
GNN-based Multimodal Representation
Heterogeneous Multimodal User-Item Graph.
Homogeneous Multimodal Item-Item Graph.
Mutual Information Perspective of Modal-Unique and -Generic Learning
Splitting modal Representation
Minimizing the Upper Bound for Modal-unique Learning
Maximizing the Lower Bound for Modal-generic Learning
Fusion and Optimization
Experiments
Experimental Settings
Performance Comparison (RQ1)
...and 11 more sections

Key Result

Theorem 1

Suppose two random vectors $x$ and $y$ in n-dimensional space, which are at an angle $\theta$. $x$ and $y$ are almost orthogonal in general high-dimensional space.

Figures (6)

Figure 1: We illustrate the difference between the three strategies, (a) The full alignment modality strategy of most SSL-based methods, (b) Learning modal-unique using orthogonal constraint. (c) Our separate learning strategy.
Figure 2: Overall our proposed framework.
Figure 3: The effect of each module on SEA.
Figure 4: Impact of the alignment weight $\alpha$, distancing weight $\beta$ and temperature coefficient $\tau$.
Figure 5: Distribution of representation obtained by MICRO with modal-generic part on the left and modal-unique part on the right.
...and 1 more figures

Theorems & Definitions (6)

Theorem 1
Remark 1
Corollary 1
Remark 2
Theorem 2
Remark 3

It is Never Too Late to Mend: Separate Learning for Multimedia Recommendation

TL;DR

Abstract

It is Never Too Late to Mend: Separate Learning for Multimedia Recommendation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)