Table of Contents
Fetching ...

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

Yuxiao Chen, Jue Wang, Zhikang Zhang, Jingru Yi, Xu Zhang, Yang Zou, Zhaowei Cai, Jianbo Yuan, Xinyu Li, Hao Yang, Davide Modolo

TL;DR

This paper tackles the challenge of long-form video understanding under token-budget constraints in large multimodal models. It introduces an information-density-based Adaptive Video Sampler (AVS) and an autoencoder-based Spatiotemporal Video Compressor (SVC) that are trained end-to-end with an MLLM, achieving a $64\times$ reduction in visual tokens. The approach yields strong results on both long-form and standard video benchmarks, outperforming several state-of-the-art methods while using far fewer tokens. This work significantly enhances the practicality of applying LLMs to hours-long videos, particularly in resource-constrained settings.

Abstract

With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

TL;DR

This paper tackles the challenge of long-form video understanding under token-budget constraints in large multimodal models. It introduces an information-density-based Adaptive Video Sampler (AVS) and an autoencoder-based Spatiotemporal Video Compressor (SVC) that are trained end-to-end with an MLLM, achieving a reduction in visual tokens. The approach yields strong results on both long-form and standard video benchmarks, outperforming several state-of-the-art methods while using far fewer tokens. This work significantly enhances the practicality of applying LLMs to hours-long videos, particularly in resource-constrained settings.

Abstract

With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.
Paper Structure (22 sections, 11 equations, 5 figures, 7 tables)

This paper contains 22 sections, 11 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An overview of our method and previous works, showing the different way of modeling long-form video with LLM. Specifically, (a) interpret clip into clip-level caption and aggregate via LLM in the linguistic space, (b) uniformly sample the video frame and leverage paired text to compress video tokens, and (c) our proposed method that leverage adaptive video sampler (AVS) and autoencoder based spatiotemporal video compressor (SVC).
  • Figure 2: Overview of the proposed method.
  • Figure 3: Examples of sampled frames using uniform sampling (top) compared to our AVS (bottom). AVS successfully locate the key frame to answer the question.
  • Figure 4: Examples of sampled frames using uniform sampling (top row) compared to our AVS (bottom row). AVS successfully locates the key frame to answer the question.
  • Figure 5: Examples of sampled frames using uniform sampling (top row) compared to our AVS (bottom row). AVS successfully locates the key frame to answer the question.