Table of Contents
Fetching ...

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Yatai Ji, Shilong Zhang, Jie Wu, Peize Sun, Weifeng Chen, Xuefeng Xiao, Sidi Yang, Yujiu Yang, Ping Luo

TL;DR

The paper tackles the challenge of memory and recognition of multiple character identities across scenes for movie understanding. It introduces IDA-VLM, an ID-aware LVLM built with dual-stage visual instruction tuning using ID references and a specialized ID-Former, and MM-ID, a benchmark with four progressive tasks to measure cross-scene identity memory. Through extensive experiments, IDA-VLM achieves state-of-the-art performance on MM-ID and reveals limitations of existing LVLMs in identity-centric reasoning. The work advances multi-identity visual understanding and lays groundwork for AI systems capable of following complex visual narratives like films.

Abstract

The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenario, while their ability to associate instances across different scenes has not yet been explored, which is essential for understanding complex visual content, such as movies with multiple characters and intricate plots. Towards movie understanding, a critical initial step for LVLMs is to unleash the potential of character identities memory and recognition across multiple visual scenarios. To achieve the goal, we propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM. Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location, question-answering, and captioning. Our findings highlight the limitations of existing LVLMs in recognizing and associating instance identities with ID reference. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

TL;DR

The paper tackles the challenge of memory and recognition of multiple character identities across scenes for movie understanding. It introduces IDA-VLM, an ID-aware LVLM built with dual-stage visual instruction tuning using ID references and a specialized ID-Former, and MM-ID, a benchmark with four progressive tasks to measure cross-scene identity memory. Through extensive experiments, IDA-VLM achieves state-of-the-art performance on MM-ID and reveals limitations of existing LVLMs in identity-centric reasoning. The work advances multi-identity visual understanding and lays groundwork for AI systems capable of following complex visual narratives like films.

Abstract

The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenario, while their ability to associate instances across different scenes has not yet been explored, which is essential for understanding complex visual content, such as movies with multiple characters and intricate plots. Towards movie understanding, a critical initial step for LVLMs is to unleash the potential of character identities memory and recognition across multiple visual scenarios. To achieve the goal, we propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM. Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location, question-answering, and captioning. Our findings highlight the limitations of existing LVLMs in recognizing and associating instance identities with ID reference. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.
Paper Structure (23 sections, 9 figures, 12 tables)

This paper contains 23 sections, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Comparison of different Visual Instruction Tuning Formats. For visual instruction tuning with ID reference, we arrange the names and images of each character as references. The model should be able to recognize the correct character identity and then answer the user's instructions.
  • Figure 2: IDA-VLM is an end-to-end vision-language model for processing instructions that contain references to specific instances. We introduce ID reference as using a correlating character reference image and corresponding name to characterize an identity, exemplified as Julia is <ID-Img{i}>. During tokenization and conversion to embeddings, the embedding of <ID-Img{i}> and <Test-Img{i}> in the instruction are replaced with the ID and Test image embeddings respectively. A simple yet effective image feature projector termed ID-Former is proposed to enhance the ID identification ability. As the output in the figure, IDA-VLM can memorize these character IDs, recognize them in test images, and respond to user instructions with the correct ID references.
  • Figure 3: First-stage and second-stage instruction tuning data construction pipelines.
  • Figure 4: Data samples of MM-ID. Each sample consists of ID images of each character, and test images containing multiple characters. Our model can memorize the identity information in ID reference and generalizes to recognize characters from different scenes.
  • Figure 5: We present visualizations of selected samples from MM-ID, corresponding to Q&A and caption sub-tasks. We showcase responses of GPT-4V, Gemini-pro and ours (IDA-VLM).
  • ...and 4 more figures