Table of Contents
Fetching ...

Movie Recommendation with Poster Attention via Multi-modal Transformer Feature Fusion

Linhan Xia, Yicheng Yang, Ziou Chen, Zheng Yang, Shengxin Zhu

TL;DR

This work addresses the sparsity of user-item interactions in movie recommendations by fusing text descriptions and poster imagery through a multi-modal Transformer. It deploys BERT for textual features, ViT for poster features, and a Transformer-based fusion to predict user ratings, evaluated on MovieLens 100K and 1M with RMSE improvements over traditional and single-modal baselines. Key contributions include an end-to-end cross-modal framework with CLS-based representations and token-level fusion that demonstrates tangible gains from incorporating poster information. The findings underscore the value of integrating visual poster cues with narrative text in scalable, multi-modal recommender systems for improved personalization.

Abstract

Pre-trained models learn general representations from large datsets which can be fine-turned for specific tasks to significantly reduce training time. Pre-trained models like generative pretrained transformers (GPT), bidirectional encoder representations from transformers (BERT), vision transfomers (ViT) have become a cornerstone of current research in machine learning. This study proposes a multi-modal movie recommendation system by extract features of the well designed posters for each movie and the narrative text description of the movie. This system uses the BERT model to extract the information of text modality, the ViT model applied to extract the information of poster/image modality, and the Transformer architecture for feature fusion of all modalities to predict users' preference. The integration of pre-trained foundational models with some smaller data sets in downstream applications capture multi-modal content features in a more comprehensive manner, thereby providing more accurate recommendations. The efficiency of the proof-of-concept model is verified by the standard benchmark problem the MovieLens 100K and 1M datasets. The prediction accuracy of user ratings is enhanced in comparison to the baseline algorithm, thereby demonstrating the potential of this cross-modal algorithm to be applied for movie or video recommendation.

Movie Recommendation with Poster Attention via Multi-modal Transformer Feature Fusion

TL;DR

This work addresses the sparsity of user-item interactions in movie recommendations by fusing text descriptions and poster imagery through a multi-modal Transformer. It deploys BERT for textual features, ViT for poster features, and a Transformer-based fusion to predict user ratings, evaluated on MovieLens 100K and 1M with RMSE improvements over traditional and single-modal baselines. Key contributions include an end-to-end cross-modal framework with CLS-based representations and token-level fusion that demonstrates tangible gains from incorporating poster information. The findings underscore the value of integrating visual poster cues with narrative text in scalable, multi-modal recommender systems for improved personalization.

Abstract

Pre-trained models learn general representations from large datsets which can be fine-turned for specific tasks to significantly reduce training time. Pre-trained models like generative pretrained transformers (GPT), bidirectional encoder representations from transformers (BERT), vision transfomers (ViT) have become a cornerstone of current research in machine learning. This study proposes a multi-modal movie recommendation system by extract features of the well designed posters for each movie and the narrative text description of the movie. This system uses the BERT model to extract the information of text modality, the ViT model applied to extract the information of poster/image modality, and the Transformer architecture for feature fusion of all modalities to predict users' preference. The integration of pre-trained foundational models with some smaller data sets in downstream applications capture multi-modal content features in a more comprehensive manner, thereby providing more accurate recommendations. The efficiency of the proof-of-concept model is verified by the standard benchmark problem the MovieLens 100K and 1M datasets. The prediction accuracy of user ratings is enhanced in comparison to the baseline algorithm, thereby demonstrating the potential of this cross-modal algorithm to be applied for movie or video recommendation.
Paper Structure (16 sections, 15 equations, 4 figures, 3 tables)

This paper contains 16 sections, 15 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Pipeline of proposed model where ViT is Vision transformer and BERT is Bidirectional Encoder Representations from transformer. Both of them are pre-trained model downloaded from Hugging Face.
  • Figure 2: Feature extraction process of BERT model in our research, the dimension of BERT's out put is 768.
  • Figure 3: Working process of proposed image feature extraction method,Where transformation is converting all images to a standard 224 by 224 square.
  • Figure 4: Framework of proposed features fusion method, where $[CLS]$ and $[SEP]$ token is the beginning and the end of sequence.