MDVT: Enhancing Multimodal Recommendation with Model-Agnostic Multimodal-Driven Virtual Triplets
Jinfeng Xu, Zheyu Chen, Jinze Li, Shuo Yang, Hewei Wang, Yijie Li, Mengran Li, Puzhen Wu, Edith C. H. Ngai
TL;DR
MDVT tackles data sparsity in multimodal recommender systems by introducing multimodal-driven virtual triplets that provide informative supervision signals beyond user-item interactions. The framework is model-agnostic and comprises a virtual triplet constructor, three warm-up threshold strategies (static, dynamic, hybrid), and an enhanced pairwise loss that blends standard BPR with virtual-triplet supervision. Empirical results across multiple real-world datasets and state-of-the-art models show consistent performance gains, with the hybrid strategy offering a robust and efficient trade-off between search cost and accuracy. The approach accelerates convergence, improves performance in sparse regimes, and remains compatible with robustness and augmentation techniques such as AMR and GPT-4o, suggesting practical impact for scalable, high-quality multimodal recommendations.
Abstract
The data sparsity problem significantly hinders the performance of recommender systems, as traditional models rely on limited historical interactions to learn user preferences and item properties. While incorporating multimodal information can explicitly represent these preferences and properties, existing works often use it only as side information, failing to fully leverage its potential. In this paper, we propose MDVT, a model-agnostic approach that constructs multimodal-driven virtual triplets to provide valuable supervision signals, effectively mitigating the data sparsity problem in multimodal recommendation systems. To ensure high-quality virtual triplets, we introduce three tailored warm-up threshold strategies: static, dynamic, and hybrid. The static warm-up threshold strategy exhaustively searches for the optimal number of warm-up epochs but is time-consuming and computationally intensive. The dynamic warm-up threshold strategy adjusts the warm-up period based on loss trends, improving efficiency but potentially missing optimal performance. The hybrid strategy combines both, using the dynamic strategy to find the approximate optimal number of warm-up epochs and then refining it with the static strategy in a narrow hyper-parameter space. Once the warm-up threshold is satisfied, the virtual triplets are used for joint model optimization by our enhanced pair-wise loss function without causing significant gradient skew. Extensive experiments on multiple real-world datasets demonstrate that integrating MDVT into advanced multimodal recommendation models effectively alleviates the data sparsity problem and improves recommendation performance, particularly in sparse data scenarios.
