GUME: Graphs and User Modalities Enhancement for Long-Tail Multimodal Recommendation
Guojiao Lin, Zhen Meng, Dongjie Wang, Qingqing Long, Yuanchun Zhou, Meng Xiao
TL;DR
GUME tackles long-tail multimodal recommendations by enhancing tail-item connectivity through multimodal similarity-driven graph augmentation and by learning richer user modality representations via explicit interaction and extended interest embeddings. It introduces modality item graphs, semantic-neighbor augmentation, attribute separation into coarse and fine granularity, and dual alignment objectives to denoise signals from internal and external perspectives, all trained with a BPR-based objective. The approach yields strong performance gains across four Amazon domains, with notable improvements on tail items and evidence from ablations, visualizations, and hyperparameter analyses. Overall, GUME provides a scalable, generalizable framework that leverages multimodal item similarities and contrastive learning to improve long-tail multimodal recommendations in real-world datasets.
Abstract
Multimodal recommendation systems (MMRS) have received considerable attention from the research community due to their ability to jointly utilize information from user behavior and product images and text. Previous research has two main issues. First, many long-tail items in recommendation systems have limited interaction data, making it difficult to learn comprehensive and informative representations. However, past MMRS studies have overlooked this issue. Secondly, users' modality preferences are crucial to their behavior. However, previous research has primarily focused on learning item modality representations, while user modality representations have remained relatively simplistic.To address these challenges, we propose a novel Graphs and User Modalities Enhancement (GUME) for long-tail multimodal recommendation. Specifically, we first enhance the user-item graph using multimodal similarity between items. This improves the connectivity of long-tail items and helps them learn high-quality representations through graph propagation. Then, we construct two types of user modalities: explicit interaction features and extended interest features. By using the user modality enhancement strategy to maximize mutual information between these two features, we improve the generalization ability of user modality representations. Additionally, we design an alignment strategy for modality data to remove noise from both internal and external perspectives. Extensive experiments on four publicly available datasets demonstrate the effectiveness of our approach.
