BiVRec: Bidirectional View-based Multimodal Sequential Recommendation
Jiaxi Hu, Jingtong Gao, Xiangyu Zhao, Yuehong Hu, Yuxuan Liang, Yiqi Wang, Ming He, Zitao Liu, Hongzhi Yin
TL;DR
This work tackles the limitations of both ID-dominant and pure multimodal regimes in sequential recommendation by introducing BivRec, a bidirectional framework that jointly trains ID and multimodal views. It relies on three modules—Multi-scale Interest Embedding, Intra-View Interest Decomposition (Gaussian and Cluster attention), and Cross-View Interest Learning (coarse and fine-grained signals)—to build structured, scalable user interests and to learn synergistic relationships across views. The model optimizes a multi-task objective combining two view-specific recommendations with cross-view semantic alignment and allocation losses, achieving state-of-the-art results on five datasets and demonstrating robustness to noise, cold-start scenarios, and cross-dataset transfer. Overall, BivRec offers a cost-efficient, flexible approach that leverages both cross-dataset transferability of multimodal features and residual user-ID information for superior, bimodal recommendations.
Abstract
The integration of multimodal information into sequential recommender systems has attracted significant attention in recent research. In the initial stages of multimodal sequential recommendation models, the mainstream paradigm was ID-dominant recommendations, wherein multimodal information was fused as side information. However, due to their limitations in terms of transferability and information intrusion, another paradigm emerged, wherein multimodal features were employed directly for recommendation, enabling recommendation across datasets. Nonetheless, it overlooked user ID information, resulting in low information utilization and high training costs. To this end, we propose an innovative framework, BivRec, that jointly trains the recommendation tasks in both ID and multimodal views, leveraging their synergistic relationship to enhance recommendation performance bidirectionally. To tackle the information heterogeneity issue, we first construct structured user interest representations and then learn the synergistic relationship between them. Specifically, BivRec comprises three modules: Multi-scale Interest Embedding, comprehensively modeling user interests by expanding user interaction sequences with multi-scale patching; Intra-View Interest Decomposition, constructing highly structured interest representations using carefully designed Gaussian attention and Cluster attention; and Cross-View Interest Learning, learning the synergistic relationship between the two recommendation views through coarse-grained overall semantic similarity and fine-grained interest allocation similarity BiVRec achieves state-of-the-art performance on five datasets and showcases various practical advantages.
