Attention-based sequential recommendation system using multimodal data
Hyungtaik Oh, Wonkeun Jo, Dongil Kim
TL;DR
This work tackles the challenge of leveraging multimodal item data in sequential recommendations by introducing Multimodal Attention Fusion (MAF), which applies independent attention to ID and multimodal features (images via VGG, texts via BERT, and categories) and fuses them for next-item prediction. The model uses fixed-length sequences, explicit multimodal embeddings, and multitask losses to improve generalization across modalities. Empirical results on four Amazon datasets show that incorporating multimodal data consistently enhances performance, with modality contributions varying by dataset size and characteristics. The approach also provides attention weight visualizations that reveal how sequence and multimodal cues are integrated, though it incurs higher computational cost which the authors propose to address in future work.
Abstract
Sequential recommendation systems that model dynamic preferences based on a use's past behavior are crucial to e-commerce. Recent studies on these systems have considered various types of information such as images and texts. However, multimodal data have not yet been utilized directly to recommend products to users. In this study, we propose an attention-based sequential recommendation method that employs multimodal data of items such as images, texts, and categories. First, we extract image and text features from pre-trained VGG and BERT and convert categories into multi-labeled forms. Subsequently, attention operations are performed independent of the item sequence and multimodal representations. Finally, the individual attention information is integrated through an attention fusion function. In addition, we apply multitask learning loss for each modality to improve the generalization performance. The experimental results obtained from the Amazon datasets show that the proposed method outperforms those of conventional sequential recommendation systems.
