Are Multimodal Embeddings Truly Beneficial for Recommendation? A Deep Dive into Whole vs. Individual Modalities
Yu Ye, Junchen Fu, Yu Song, Kaiwen Zheng, Joemon M. Jose
TL;DR
This paper questions the assumed benefit of multimodal embeddings in recommendation by conducting a large-scale, controlled Modality Knockout study across 14 state-of-the-art models on three Amazon datasets. It systematically replaces visual and/or textual embeddings with constants or random noise during training and inference to isolate each modality's contribution, evaluating using $Recall$ and $NDCG$ metrics. The key findings show that multimodal embeddings improve performance mainly when paired with sophisticated graph-based fusion, while text alone often matches full multimodal performance and images provide limited gains, revealing a text-dominant dynamic and potential artifacts in simple fusion baselines. The work provides practical guidance for model design and evaluation in multimodal recommendation and commits to releasing code and datasets for reproducibility and further research.
Abstract
Multimodal recommendation has emerged as a mainstream paradigm, typically leveraging text and visual embeddings extracted from pre-trained models such as Sentence-BERT, Vision Transformers, and ResNet. This approach is founded on the intuitive assumption that incorporating multimodal embeddings can enhance recommendation performance. However, despite its popularity, this assumption lacks comprehensive empirical verification. This presents a critical research gap. To address it, we pose the central research question of this paper: Are multimodal embeddings truly beneficial for recommendation? To answer this question, we conduct a large-scale empirical study examining the role of text and visual embeddings in modern multimodal recommendation models, both as a whole and individually. Specifically, we pose two key research questions: (1) Do multimodal embeddings as a whole improve recommendation performance? (2) Is each individual modality - text and image - useful when used alone? To isolate the effect of individual modalities - text or visual - we employ a modality knockout strategy by setting the corresponding embeddings to either constant values or random noise. To ensure the scale and comprehensiveness of our study, we evaluate 14 widely used state-of-the-art multimodal recommendation models. Our findings reveal that: (1) multimodal embeddings generally enhance recommendation performance - particularly when integrated through more sophisticated graph-based fusion models. Surprisingly, commonly adopted baseline models with simple fusion schemes, such as VBPR and BM3, show only limited gains. (2) The text modality alone achieves performance comparable to the full multimodal setting in most cases, whereas the image modality alone does not. These results offer foundational insights and practical guidance for the multimodal recommendation community.
