Table of Contents
Fetching ...

Dataset and Models for Item Recommendation Using Multi-Modal User Interactions

Simone Borg Bruun, Krisztian Balog, Maria Maistro

TL;DR

This work tackles learning item recommendations from multi-modal user interactions across websites and call centers, addressing the challenge of naturally missing modalities. It introduces a real-world insurance dataset containing web sessions, conversation transcripts, and purchases, and develops three joint-representation models—Keyword, Latent Feature, and Relative Representation—to map heterogeneous interactions into a shared feature space, alongside strong baselines. Experimental results show that shared-space models outperform uni-modal baselines, with Latent Feature and Relative Representation offering robust gains and Keyword effectively capturing cross-modal interactions; the approach highlights the complementary information across modalities. The study advances practical multi-modal recommender research in high-stakes domains and provides public resources to spur further progress in learning from diverse user interactions for personalized recommendations.

Abstract

While recommender systems with multi-modal item representations (image, audio, and text), have been widely explored, learning recommendations from multi-modal user interactions (e.g., clicks and speech) remains an open problem. We study the case of multi-modal user interactions in a setting where users engage with a service provider through multiple channels (website and call center). In such cases, incomplete modalities naturally occur, since not all users interact through all the available channels. To address these challenges, we publish a real-world dataset that allows progress in this under-researched area. We further present and benchmark various methods for leveraging multi-modal user interactions for item recommendations, and propose a novel approach that specifically deals with missing modalities by mapping user interactions to a common feature space. Our analysis reveals important interactions between the different modalities and that a frequently occurring modality can enhance learning from a less frequent one.

Dataset and Models for Item Recommendation Using Multi-Modal User Interactions

TL;DR

This work tackles learning item recommendations from multi-modal user interactions across websites and call centers, addressing the challenge of naturally missing modalities. It introduces a real-world insurance dataset containing web sessions, conversation transcripts, and purchases, and develops three joint-representation models—Keyword, Latent Feature, and Relative Representation—to map heterogeneous interactions into a shared feature space, alongside strong baselines. Experimental results show that shared-space models outperform uni-modal baselines, with Latent Feature and Relative Representation offering robust gains and Keyword effectively capturing cross-modal interactions; the approach highlights the complementary information across modalities. The study advances practical multi-modal recommender research in high-stakes domains and provides public resources to spur further progress in learning from diverse user interactions for personalized recommendations.

Abstract

While recommender systems with multi-modal item representations (image, audio, and text), have been widely explored, learning recommendations from multi-modal user interactions (e.g., clicks and speech) remains an open problem. We study the case of multi-modal user interactions in a setting where users engage with a service provider through multiple channels (website and call center). In such cases, incomplete modalities naturally occur, since not all users interact through all the available channels. To address these challenges, we publish a real-world dataset that allows progress in this under-researched area. We further present and benchmark various methods for leveraging multi-modal user interactions for item recommendations, and propose a novel approach that specifically deals with missing modalities by mapping user interactions to a common feature space. Our analysis reveals important interactions between the different modalities and that a frequently occurring modality can enhance learning from a less frequent one.
Paper Structure (27 sections, 11 equations, 9 figures, 4 tables)

This paper contains 27 sections, 11 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Distribution of the users having conversations and web sessions.
  • Figure 2: Example of a user's past events. An event can either be a conversation, in the form of text, or a web session, in the form of action tags.
  • Figure 3: Schematic overview of the baseline models.
  • Figure 4: Schematic overview of our models.
  • Figure 5: Neural architectures of our models.
  • ...and 4 more figures