Table of Contents
Fetching ...

A Multimodal Single-Branch Embedding Network for Recommendation in Cold-Start and Missing Modality Scenarios

Christian Ganhör, Marta Moscati, Anna Hausberger, Shah Nawaz, Markus Schedl

TL;DR

The paper addresses cold-start and missing-modality challenges in recommender systems by introducing SiBraR, a multimodal single-branch embedding network that shares one encoder across modalities to map users, items, and content alongside interactions into a unified embedding space. The framework uses a single deep function $g$ to embed any modality, computing scores via $\hat{y}_{ij} = \mathbf{e}_{i} \cdot \mathbf{e}_{j}$ and optionally optimizing with a Bayesian Personalized Ranking loss $\mathcal{L}_{BPR}$ and a symmetric InfoNCE contrastive loss $\mathcal{L}_{SInfoNCE}$ to align modalities. Extensive experiments on Music4All-Onion, MovieLens ML-1M, and Amazon Video Games show SiBraR significantly outperforms CF and state-of-the-art content-based RSs in item cold-start scenarios and remains competitive in warm-start settings, while visualization indicates a reduced modality gap in the shared embedding space. The results underscore SiBraR’s practical value for robust multimodal recommendations under partial modality availability and sparse interactions. The work contributes a novel single-branch multimodal paradigm for RS and demonstrates its effectiveness across multiple domains with diverse content modalities.

Abstract

Most recommender systems adopt collaborative filtering (CF) and provide recommendations based on past collective interactions. Therefore, the performance of CF algorithms degrades when few or no interactions are available, a scenario referred to as cold-start. To address this issue, previous work relies on models leveraging both collaborative data and side information on the users or items. Similar to multimodal learning, these models aim at combining collaborative and content representations in a shared embedding space. In this work we propose a novel technique for multimodal recommendation, relying on a multimodal Single-Branch embedding network for Recommendation (SiBraR). Leveraging weight-sharing, SiBraR encodes interaction data as well as multimodal side information using the same single-branch embedding network on different modalities. This makes SiBraR effective in scenarios of missing modality, including cold start. Our extensive experiments on large-scale recommendation datasets from three different recommendation domains (music, movie, and e-commerce) and providing multimodal content information (audio, text, image, labels, and interactions) show that SiBraR significantly outperforms CF as well as state-of-the-art content-based RSs in cold-start scenarios, and is competitive in warm scenarios. We show that SiBraR's recommendations are accurate in missing modality scenarios, and that the model is able to map different modalities to the same region of the shared embedding space, hence reducing the modality gap.

A Multimodal Single-Branch Embedding Network for Recommendation in Cold-Start and Missing Modality Scenarios

TL;DR

The paper addresses cold-start and missing-modality challenges in recommender systems by introducing SiBraR, a multimodal single-branch embedding network that shares one encoder across modalities to map users, items, and content alongside interactions into a unified embedding space. The framework uses a single deep function to embed any modality, computing scores via and optionally optimizing with a Bayesian Personalized Ranking loss and a symmetric InfoNCE contrastive loss to align modalities. Extensive experiments on Music4All-Onion, MovieLens ML-1M, and Amazon Video Games show SiBraR significantly outperforms CF and state-of-the-art content-based RSs in item cold-start scenarios and remains competitive in warm-start settings, while visualization indicates a reduced modality gap in the shared embedding space. The results underscore SiBraR’s practical value for robust multimodal recommendations under partial modality availability and sparse interactions. The work contributes a novel single-branch multimodal paradigm for RS and demonstrates its effectiveness across multiple domains with diverse content modalities.

Abstract

Most recommender systems adopt collaborative filtering (CF) and provide recommendations based on past collective interactions. Therefore, the performance of CF algorithms degrades when few or no interactions are available, a scenario referred to as cold-start. To address this issue, previous work relies on models leveraging both collaborative data and side information on the users or items. Similar to multimodal learning, these models aim at combining collaborative and content representations in a shared embedding space. In this work we propose a novel technique for multimodal recommendation, relying on a multimodal Single-Branch embedding network for Recommendation (SiBraR). Leveraging weight-sharing, SiBraR encodes interaction data as well as multimodal side information using the same single-branch embedding network on different modalities. This makes SiBraR effective in scenarios of missing modality, including cold start. Our extensive experiments on large-scale recommendation datasets from three different recommendation domains (music, movie, and e-commerce) and providing multimodal content information (audio, text, image, labels, and interactions) show that SiBraR significantly outperforms CF as well as state-of-the-art content-based RSs in cold-start scenarios, and is competitive in warm scenarios. We show that SiBraR's recommendations are accurate in missing modality scenarios, and that the model is able to map different modalities to the same region of the shared embedding space, hence reducing the modality gap.
Paper Structure (15 sections, 4 equations, 3 figures, 2 tables)

This paper contains 15 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Item-SiBraR model and training procedure. The SiBraR network represents the single-branch encoding network $g$ shared across modalities. For each user--item interaction pair $(u_i, i_j)$ in the training set, the recommendation loss $\mathcal{L}_\text{BPR}$ is computed between positive and negative items. The contrastive loss $\mathcal{L}_\text{SInfoNCE}$ is computed for two item modalities and for the set of items consisting of positive item and set of negatives.
  • Figure 2: Performance of SiBraR on the test set of warm-start Onion, based on a varying set of modalities used. The bottom integers show the number of modalities. If a modality is used, its block is filled with the corresponding color in the central plot. The bar plot shows SiBraR's performance in terms of nDCG@$10$ for each set of modalities. The gray dashed horizontal lines show the nDCG@$10$ of CF and CBRS algorithms.
  • Figure 3: t-SNE projected embeddings before and after SiBraR. While different modalities can be differentiated at SiBraR input (left), modalities overlap substantially in the shared embedding space after applying SiBraR (right).