MM-GEF: Multi-modal representation meet collaborative filtering
Hao Wu, Alejandro Ariza-Casabona, Bartłomiej Twardowski, Tri Kurniawan Wijaya
TL;DR
MM-GEF addresses the challenge of integrating rich multi-modal item content with collaborative signals in recommender systems by constructing an early-fused item graph that encodes cross-modal relations and high-order user-item interactions. It fuses CLIP-derived visual and textual features in an early stage, builds modality- and collaboration-based item graphs, and propagates information via a graph interaction network before aligning with a CF backbone. Across four public datasets, MM-GEF consistently outperforms state-of-the-art multi-modal methods, with notable gains in NDCG and robustness in cold-start settings, demonstrating the value of joint, graph-based fusion. The work highlights practical implications for deploying richer item representations in real-world recommender systems, enabling better accuracy and resilience with limited interaction data.
Abstract
In modern e-commerce, item content features in various modalities offer accurate yet comprehensive information to recommender systems. The majority of previous work either focuses on learning effective item representation during modelling user-item interactions, or exploring item-item relationships by analysing multi-modal features. Those methods, however, fail to incorporate the collaborative item-user-item relationships into the multi-modal feature-based item structure. In this work, we propose a graph-based item structure enhancement method MM-GEF: Multi-Modal recommendation with Graph Early-Fusion, which effectively combines the latent item structure underlying multi-modal contents with the collaborative signals. Instead of processing the content feature in different modalities separately, we show that the early-fusion of multi-modal features provides significant improvement. MM-GEF learns refined item representations by injecting structural information obtained from both multi-modal and collaborative signals. Through extensive experiments on four publicly available datasets, we demonstrate systematical improvements of our method over state-of-the-art multi-modal recommendation methods.
