Table of Contents
Fetching ...

Learning Video Context as Interleaved Multimodal Sequences

Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou

TL;DR

This paper introduces MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts, and jointly provides character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses.

Abstract

Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classification, audio description, video-text retrieval, video captioning, and video question-answering). The code will be public at https://github.com/showlab/MovieSeq.

Learning Video Context as Interleaved Multimodal Sequences

TL;DR

This paper introduces MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts, and jointly provides character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses.

Abstract

Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classification, audio description, video-text retrieval, video captioning, and video question-answering). The code will be public at https://github.com/showlab/MovieSeq.
Paper Structure (13 sections, 1 equation, 7 figures, 10 tables)

This paper contains 13 sections, 1 equation, 7 figures, 10 tables.

Figures (7)

  • Figure 1: MovieSeq aims to address diverse challenges in understanding video contexts, enabling flexible interleaved multimodal instructions, such as Video+Images (for character identification), Video+Subtitles (for dialogues understanding), Video+Plots (for external knowledge via RAG), and Video+History (for event dependency).
  • Figure 2: Comparison between different video-language input modes. (a) Single video input, e.g.,llavaminigptvideochat. (b) In-context input, e.g.,flamingootter, showcasing examples for structured few-shot learning. (c) Our approach, utilizes flexible contexts (e.g., external character images, dialogues, etc) to associate them to produce a comprehensive response.
  • Figure 3: Illustration of the pipeline of MovieSeq. Firstly, we embed the input video as an interleaved multimodal sequence (such as images, plots, videos, or subtitles), either by linking to an external database or leveraging annotations from offline models whisperwhisperx. Then, we create an interleaved instruction (can be a combination of the above context) and feed it into the language model. The language model is trained to associate them and generate a comprehensive response.
  • Figure 5: Visualization of MovieSeq by providing different kinds of interleaved multimodal prompts for different applications.
  • Figure : (a) Diff. # of history clip under diff. instruction template (iv).
  • ...and 2 more figures