Unraveling Movie Genres through Cross-Attention Fusion of Bi-Modal Synergy of Poster
Utsav Kumar Nareti, Chandranath Adak, Soumi Chattopadhyay, Pichao Wang
TL;DR
This work addresses multi-label movie genreClassification from posters by leveraging both visual content and OCR-extracted text. It introduces a cross-attention fusion framework with a Multi-head Cross Attention Module (MCAM) and a Sequential Multi-head Self Attention Module (SMSAM) to integrate poster imagery and textual cues, using CLIP-based features and an OCR pipeline. Training employs an Asymmetric Loss (ASL) to handle label imbalance, and post-processing with a 0.5 threshold yields binary genre predictions. Evaluated on 13-genre IMDb poster data (13882 posters), the approach outperforms baselines and SOTA methods, highlighting the value of textual information in posters for pre-release genre tagging and offering a robust, scalable solution for downstream recommendation and search tasks.
Abstract
Movie posters are not just decorative; they are meticulously designed to capture the essence of a movie, such as its genre, storyline, and tone/vibe. For decades, movie posters have graced cinema walls, billboards, and now our digital screens as a form of digital posters. Movie genre classification plays a pivotal role in film marketing, audience engagement, and recommendation systems. Previous explorations into movie genre classification have been mostly examined in plot summaries, subtitles, trailers and movie scenes. Movie posters provide a pre-release tantalizing glimpse into a film's key aspects, which can ignite public interest. In this paper, we presented the framework that exploits movie posters from a visual and textual perspective to address the multilabel movie genre classification problem. Firstly, we extracted text from movie posters using an OCR and retrieved the relevant embedding. Next, we introduce a cross-attention-based fusion module to allocate attention weights to visual and textual embedding. In validating our framework, we utilized 13882 posters sourced from the Internet Movie Database (IMDb). The outcomes of the experiments indicate that our model exhibited promising performance and outperformed even some prominent contemporary architectures.
