Formalizing Multimedia Recommendation through Multimodal Deep Learning

Daniele Malitesta; Giandomenico Cornacchia; Claudio Pomo; Felice Antonio Merra; Tommaso Di Noia; Eugenio Di Sciascio

Formalizing Multimedia Recommendation through Multimodal Deep Learning

Daniele Malitesta, Giandomenico Cornacchia, Claudio Pomo, Felice Antonio Merra, Tommaso Di Noia, Eugenio Di Sciascio

TL;DR

This paper addresses the lack of a unified formalization for multimedia recommendation by proposing a multimodal deep-learning–inspired schema that structures the problem into multimodal input data, feature processing, and multimodal fusion. It formalizes the components, applies them to four representative models, and integrates the framework into the Elliot benchmarking platform to systematically compare six multimodal recommenders against four classical baselines using both accuracy and beyond-accuracy metrics. The benchmarking reveals that certain multimodal models achieve strong accuracy (e.g., LATTICE, BM3, FREEDOM), while beyond-accuracy measures often favor approaches like GRCN that balance novelty, diversity, and popularity bias. The work also discusses technical challenges such as missing modalities and pre-trained feature limitations, and outlines future directions toward domain-specific features and more extensive, fair evaluations with standardized protocols and datasets.

Abstract

Recommender systems (RSs) offer personalized navigation experiences on online platforms, but recommendation remains a challenging task, particularly in specific scenarios and domains. Multimodality can help tap into richer information sources and construct more refined user/item profiles for recommendations. However, existing literature lacks a shared and universal schema for modeling and solving the recommendation problem through the lens of multimodality. This work aims to formalize a general multimodal schema for multimedia recommendation. It provides a comprehensive literature review of multimodal approaches for multimedia recommendation from the last eight years, outlines the theoretical foundations of a multimodal pipeline, and demonstrates its rationale by applying it to selected state-of-the-art approaches. The work also conducts a benchmarking analysis of recent algorithms for multimedia recommendation within Elliot, a rigorous framework for evaluating recommender systems. The main aim is to provide guidelines for designing and implementing the next generation of multimodal approaches in multimedia recommendation.

Formalizing Multimedia Recommendation through Multimodal Deep Learning

TL;DR

Abstract

Paper Structure (49 sections, 19 equations, 3 figures, 7 tables, 1 algorithm)

This paper contains 49 sections, 19 equations, 3 figures, 7 tables, 1 algorithm.

Introduction
Literature Review (RQ1)
Which modalities?
How to process modalities?
When to fuse modalities?
Similar works to this paper
A formal multimodal schema for multimedia recommendation (RQ2)
Classical recommendation task
Multimodal input data
Multimodal feature processing
Feature extraction
Multimodal representation
Multimodal feature fusion
Multimodal recommendation task
Implementation and benchmarking (RQ3)
...and 34 more sections

Figures (3)

Figure 1: Our multimodal schema for multimedia recommendation. After (1) a modality-aware feature extraction, the extracted features may be either directly represented into a unique latent space (2a) or projected into a different latent space for each modality (2b). While in the former case, the multimodal representation is used to produce a prediction (4), in the latter case, all modalities must undergo a fusion phase (3). In the early fusion (3a), we produce a final representation that is used for prediction (4). Otherwise, we first produce a different prediction for each modality (4), and then we fuse them (late fusion) into a single predicted value (3b).
Figure 2: A visual representation of Joint and Coordinate multimodal representation (above and below, respectively).
Figure 3: An example of how users generate and upload multimodal feedback about interacted items (e.g., textual reviews, product photos, or even video reviews) on online platforms. Such user-item sources of information may be suitably exploited to better profile user' preferences DBLP:conf/cikm/AnelliDNSFMP22.

Formalizing Multimedia Recommendation through Multimodal Deep Learning

TL;DR

Abstract

Formalizing Multimedia Recommendation through Multimodal Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)