Table of Contents
Fetching ...

Attention-based sequential recommendation system using multimodal data

Hyungtaik Oh, Wonkeun Jo, Dongil Kim

TL;DR

This work tackles the challenge of leveraging multimodal item data in sequential recommendations by introducing Multimodal Attention Fusion (MAF), which applies independent attention to ID and multimodal features (images via VGG, texts via BERT, and categories) and fuses them for next-item prediction. The model uses fixed-length sequences, explicit multimodal embeddings, and multitask losses to improve generalization across modalities. Empirical results on four Amazon datasets show that incorporating multimodal data consistently enhances performance, with modality contributions varying by dataset size and characteristics. The approach also provides attention weight visualizations that reveal how sequence and multimodal cues are integrated, though it incurs higher computational cost which the authors propose to address in future work.

Abstract

Sequential recommendation systems that model dynamic preferences based on a use's past behavior are crucial to e-commerce. Recent studies on these systems have considered various types of information such as images and texts. However, multimodal data have not yet been utilized directly to recommend products to users. In this study, we propose an attention-based sequential recommendation method that employs multimodal data of items such as images, texts, and categories. First, we extract image and text features from pre-trained VGG and BERT and convert categories into multi-labeled forms. Subsequently, attention operations are performed independent of the item sequence and multimodal representations. Finally, the individual attention information is integrated through an attention fusion function. In addition, we apply multitask learning loss for each modality to improve the generalization performance. The experimental results obtained from the Amazon datasets show that the proposed method outperforms those of conventional sequential recommendation systems.

Attention-based sequential recommendation system using multimodal data

TL;DR

This work tackles the challenge of leveraging multimodal item data in sequential recommendations by introducing Multimodal Attention Fusion (MAF), which applies independent attention to ID and multimodal features (images via VGG, texts via BERT, and categories) and fuses them for next-item prediction. The model uses fixed-length sequences, explicit multimodal embeddings, and multitask losses to improve generalization across modalities. Empirical results on four Amazon datasets show that incorporating multimodal data consistently enhances performance, with modality contributions varying by dataset size and characteristics. The approach also provides attention weight visualizations that reveal how sequence and multimodal cues are integrated, though it incurs higher computational cost which the authors propose to address in future work.

Abstract

Sequential recommendation systems that model dynamic preferences based on a use's past behavior are crucial to e-commerce. Recent studies on these systems have considered various types of information such as images and texts. However, multimodal data have not yet been utilized directly to recommend products to users. In this study, we propose an attention-based sequential recommendation method that employs multimodal data of items such as images, texts, and categories. First, we extract image and text features from pre-trained VGG and BERT and convert categories into multi-labeled forms. Subsequently, attention operations are performed independent of the item sequence and multimodal representations. Finally, the individual attention information is integrated through an attention fusion function. In addition, we apply multitask learning loss for each modality to improve the generalization performance. The experimental results obtained from the Amazon datasets show that the proposed method outperforms those of conventional sequential recommendation systems.
Paper Structure (19 sections, 11 equations, 4 figures, 4 tables)

This paper contains 19 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: A sequence of items purchased by the user, which contains multimodal data such as images, text, and categories.
  • Figure 2: Proposed model’s structure: 1) Features are extracted with pre-trained VGG, BERT, and etc., 2) the attention operation with multimodal representations, and 3) multi-task learning for better generalization.
  • Figure 3: Comparison of multimodal data performance of four datasets.
  • Figure 4: Video Games attention weight visualization.