It is Never Too Late to Mend: Separate Learning for Multimedia Recommendation
Zhuangzhuang He, Zihan Wang, Yonghui Yang, Haoyue Bai, Le Wu
TL;DR
This work addresses the plateau in multimedia recommendation when all modalities are fully aligned via self-supervised learning. It introduces Separate Learning (SEA), an information-theoretic framework that decomposes each modality into modal-unique and modal-generic parts and optimizes them with two MI-based objectives: minimizing an upper bound on I between the generic and unique parts to enrich modal-unique features, and maximizing a lower bound on I between the generic parts across modalities to strengthen modal-generic features. SEA uses GNN-based heterogeneous user-item graphs and homogeneous item-item graphs to learn rich, modality-aware representations, and fuses them with a BPR objective to optimize recommendations. Empirical results on three datasets show SEA consistently outperforms strong baselines, with ablations and sensitivity analyses validating the necessity and complementarity of its components. The approach offers a flexible, generalizable framework for disentangling modality-specific and modality-agnostic information in multimodal recommendation, with practical significance for improving personalization performance while preserving modality-specific attributes.
Abstract
Multimedia recommendation, which incorporates various modalities (e.g., images, texts, etc.) into user or item representation to improve recommendation quality, and self-supervised learning carries multimedia recommendation to a plateau of performance, because of its superior performance in aligning different modalities. However, more and more research finds that aligning all modal representations is suboptimal because it damages the unique attributes of each modal. These studies use subtraction and orthogonal constraints in geometric space to learn unique parts. However, our rigorous analysis reveals the flaws in this approach, such as that subtraction does not necessarily yield the desired modal-unique and that orthogonal constraints are ineffective in user and item high-dimensional representation spaces. To make up for the previous weaknesses, we propose Separate Learning (SEA) for multimedia recommendation, which mainly includes mutual information view of modal-unique and -generic learning. Specifically, we first use GNN to learn the representations of users and items in different modalities and split each modal representation into generic and unique parts. We employ contrastive log-ratio upper bound to minimize the mutual information between the general and unique parts within the same modality, to distance their representations, thus learning modal-unique features. Then, we design Solosimloss to maximize the lower bound of mutual information, to align the general parts of different modalities, thus learning more high-quality modal-generic features. Finally, extensive experiments on three datasets demonstrate the effectiveness and generalization of our proposed framework. The code is available at SEA and the full training record of the main experiment.
