Formalizing Multimedia Recommendation through Multimodal Deep Learning
Daniele Malitesta, Giandomenico Cornacchia, Claudio Pomo, Felice Antonio Merra, Tommaso Di Noia, Eugenio Di Sciascio
TL;DR
This paper addresses the lack of a unified formalization for multimedia recommendation by proposing a multimodal deep-learning–inspired schema that structures the problem into multimodal input data, feature processing, and multimodal fusion. It formalizes the components, applies them to four representative models, and integrates the framework into the Elliot benchmarking platform to systematically compare six multimodal recommenders against four classical baselines using both accuracy and beyond-accuracy metrics. The benchmarking reveals that certain multimodal models achieve strong accuracy (e.g., LATTICE, BM3, FREEDOM), while beyond-accuracy measures often favor approaches like GRCN that balance novelty, diversity, and popularity bias. The work also discusses technical challenges such as missing modalities and pre-trained feature limitations, and outlines future directions toward domain-specific features and more extensive, fair evaluations with standardized protocols and datasets.
Abstract
Recommender systems (RSs) offer personalized navigation experiences on online platforms, but recommendation remains a challenging task, particularly in specific scenarios and domains. Multimodality can help tap into richer information sources and construct more refined user/item profiles for recommendations. However, existing literature lacks a shared and universal schema for modeling and solving the recommendation problem through the lens of multimodality. This work aims to formalize a general multimodal schema for multimedia recommendation. It provides a comprehensive literature review of multimodal approaches for multimedia recommendation from the last eight years, outlines the theoretical foundations of a multimodal pipeline, and demonstrates its rationale by applying it to selected state-of-the-art approaches. The work also conducts a benchmarking analysis of recent algorithms for multimedia recommendation within Elliot, a rigorous framework for evaluating recommender systems. The main aim is to provide guidelines for designing and implementing the next generation of multimodal approaches in multimedia recommendation.
