Table of Contents
Fetching ...

CADMR: Cross-Attention and Disentangled Learning for Multimodal Recommender Systems

Yasser Khalafaoui, Martino Lovisetto, Basarab Matei, Nistor Grozavu

TL;DR

This work tackles the challenge of sparse, high-dimensional user-item rating matrices in multimodal recommender systems. It introduces CADMR, which first disentangles modality-specific features into a joint latent representation and then uses a multi-head cross-attention mechanism to align user-item interactions with multimodal item features, followed by autoencoder-based reconstruction. The model optimizes $\mathcal{L} = \mathcal{L}_{MSE} + \lambda \mathcal{L}_{TC}$ with $\lambda=0.5$, and uses $Q$, $K$, $V$ in cross-attention where the rating matrix serves as the query and fused multimodal features as keys/values. Experiments on three Amazon multimodal datasets show CADMR substantially outperforms seven baselines in NDCG@K and Recall@K, with ablation studies confirming the necessity of both cross-attention and disentangled learning. These findings highlight CADMR's robust and interpretable integration of text and image modalities for recommender systems and demonstrate the value of cross-attention in multimodal fusion.

Abstract

The increasing availability and diversity of multimodal data in recommender systems offer new avenues for enhancing recommendation accuracy and user satisfaction. However, these systems must contend with high-dimensional, sparse user-item rating matrices, where reconstructing the matrix with only small subsets of preferred items for each user poses a significant challenge. To address this, we propose CADMR, a novel autoencoder-based multimodal recommender system framework. CADMR leverages multi-head cross-attention mechanisms and Disentangled Learning to effectively integrate and utilize heterogeneous multimodal data in reconstructing the rating matrix. Our approach first disentangles modality-specific features while preserving their interdependence, thereby learning a joint latent representation. The multi-head cross-attention mechanism is then applied to enhance user-item interaction representations with respect to the learned multimodal item latent representations. We evaluate CADMR on three benchmark datasets, demonstrating significant performance improvements over state-of-the-art methods.

CADMR: Cross-Attention and Disentangled Learning for Multimodal Recommender Systems

TL;DR

This work tackles the challenge of sparse, high-dimensional user-item rating matrices in multimodal recommender systems. It introduces CADMR, which first disentangles modality-specific features into a joint latent representation and then uses a multi-head cross-attention mechanism to align user-item interactions with multimodal item features, followed by autoencoder-based reconstruction. The model optimizes with , and uses , , in cross-attention where the rating matrix serves as the query and fused multimodal features as keys/values. Experiments on three Amazon multimodal datasets show CADMR substantially outperforms seven baselines in NDCG@K and Recall@K, with ablation studies confirming the necessity of both cross-attention and disentangled learning. These findings highlight CADMR's robust and interpretable integration of text and image modalities for recommender systems and demonstrate the value of cross-attention in multimodal fusion.

Abstract

The increasing availability and diversity of multimodal data in recommender systems offer new avenues for enhancing recommendation accuracy and user satisfaction. However, these systems must contend with high-dimensional, sparse user-item rating matrices, where reconstructing the matrix with only small subsets of preferred items for each user poses a significant challenge. To address this, we propose CADMR, a novel autoencoder-based multimodal recommender system framework. CADMR leverages multi-head cross-attention mechanisms and Disentangled Learning to effectively integrate and utilize heterogeneous multimodal data in reconstructing the rating matrix. Our approach first disentangles modality-specific features while preserving their interdependence, thereby learning a joint latent representation. The multi-head cross-attention mechanism is then applied to enhance user-item interaction representations with respect to the learned multimodal item latent representations. We evaluate CADMR on three benchmark datasets, demonstrating significant performance improvements over state-of-the-art methods.

Paper Structure

This paper contains 29 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: CADMR architecture for a multimodal recommender system. In the pretraining phase, the autoencoder and modality-specific feature extractors are trained separately. In the fine-tuning phase, a cross-attention mechanism integrates the user-item rating matrix with the unified multimodal representation, refining the matrix, which is then processed through the trained autoencoder to produce the final reconstructed rating matrix.
  • Figure 2: CADMR performance with respect to the training set size.
  • Figure 3: Impact of the cross-attention number of heads on the overall performance of our model