Understanding Long Videos with Multimodal Language Models

Kanchana Ranasinghe; Xiang Li; Kumara Kahatapitiya; Michael S. Ryoo

Understanding Long Videos with Multimodal Language Models

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

TL;DR

This work interrogates whether long-video understanding by LLM-based systems relies more on world knowledge than on video modality. It shows that modality-constrained baselines can perform strongly with minimal video data, then introduces MVU, which extracts three object-centric modalities from video and fuses them through natural language prompts to an LLM, achieving state-of-the-art zero-shot results on EgoSchema, Next-QA, and robotics benchmarks. MVU uses off-the-shelf vision tools to obtain Global Object Information (GOI), Object Spatial Location (OSL), and Object Motion Trajectory (OMT), enabling interpretable, efficient multimodal fusion without video-level training. The study includes extensive ablations and analyses, demonstrating the value of each modality and the effectiveness of likelihood-based selection for fast, reliable MCQ answering.

Abstract

Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video information, sometimes even with no video specific information. Building on this, we explore injecting video-specific information into an LLM-based framework. We utilize off-the-shelf vision tools to extract three object-centric information modalities from videos, and then leverage natural language as a medium for fusing this information. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across multiple video understanding benchmarks. Strong performance also on robotics domain tasks establish its strong generality. Code: https://github.com/kahnchana/mvu

Understanding Long Videos with Multimodal Language Models

TL;DR

Abstract

Paper Structure (35 sections, 4 equations, 5 figures, 19 tables)

This paper contains 35 sections, 4 equations, 5 figures, 19 tables.

Introduction
Related Work
Naive Baselines & Likelihood Selection
Problem Formulation
Likelihood Selection
Modality Constrained Variants
Multimodal Video Understanding Framework
Vision Tools for Video Analysis
Object-Centric Information Modalities
Language based Fusion
Language based Fusion
Experiments
Long Video Question Answering
Robotics Domain
Ablations
...and 20 more sections

Figures (5)

Figure 1: Overview of Framework: We propose three variants of our framework that solves complex long-video question-answering tasks. (left-top) Just-LLM utilizes only world knowledge with zero task-specific awareness. (left-bottom) Single-Frame-VLM processes an additional center frame to obtain task context but accesses no video specific information. (right) Our complete approach, MVU extracts three additional object-centric information modalities followed by fusion in language space. LS refers to likelihood selection.
Figure 2: Likelihood Selection Workflow: We illustrate how the likelihood selection strategy adapted for video QnA tasks can be efficiently parallelized (i.e. calculated with a simple cross-entropy loss in one forward pass, followed by an argmin operation), in contrast to the setting of iteratively generating multiple tokens.
Figure 3: Overview of proposed framework for Multimodal Video Understanding, MVU.
Figure 4: Data Visualization: Example video frames from EgoSchema (top) vs OpenX (bottom) datasets. Robotics domain videos (bottom) appear out of distribution given their controlled environment and robot movements.
Figure A.1: MVU Robot Control Extension: We adapt MVU for robot manipulation (MVU-R) by framing control as a video question answering task, enabling zero-shot action prediction via vision-language prompting.

Understanding Long Videos with Multimodal Language Models

TL;DR

Abstract

Understanding Long Videos with Multimodal Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)