M^2VAE: Multi-Modal Multi-View Variational Autoencoder for Cold-start Item Recommendation
Chuan He, Yongchao Liu, Qiang Li, Wenliang Zhong, Chuntao Hong, Xinwei Yao
TL;DR
The paper tackles cold-start item recommendation by leveraging multi-modal content and explicitly modeling the multi-view structure of item features. It introduces the multi-modal multi-view variational autoencoder (M2VAE), which learns type-specific latent variables for IDs, attributes, and images, derives a common view via Product-of-Experts, and disentangles common from unique views with a dedicated contrastive loss. A user-aware Mixture-of-Experts fusion then adaptively fuses the common and unique views to form item representations, augmented by co-occurrence signals through contrastive learning and an end-to-end CVAE objective with BPR optimization. Empirical results on three real-world datasets show that M2VAE outperforms state-of-the-art baselines, with ablation studies confirming the contribution of each component and a case study illustrating interpretability of personalized view preferences. The approach offers a scalable, end-to-end solution for effective cold-start recommendations in multi-modal settings, without requiring pretraining.
Abstract
Cold-start item recommendation is a significant challenge in recommendation systems, particularly when new items are introduced without any historical interaction data. While existing methods leverage multi-modal content to alleviate the cold-start issue, they often neglect the inherent multi-view structure of modalities, the distinction between shared and modality-specific features. In this paper, we propose Multi-Modal Multi-View Variational AutoEncoder (M^2VAE), a generative model that addresses the challenges of modeling common and unique views in attribute and multi-modal features, as well as user preferences over single-typed item features. Specifically, we generate type-specific latent variables for item IDs, categorical attributes, and image features, and use Product-of-Experts (PoE) to derive a common representation. A disentangled contrastive loss decouples the common view from unique views while preserving feature informativeness. To model user inclinations, we employ a preference-guided Mixture-of-Experts (MoE) to adaptively fuse representations. We further incorporate co-occurrence signals via contrastive learning, eliminating the need for pretraining. Extensive experiments on real-world datasets validate the effectiveness of our approach.
