Table of Contents
Fetching ...

Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanxing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, Bohan Zeng, Wentao Zhang, Fuzheng Zhang, Wenjing Yang, Di Zhang

TL;DR

Mavors tackles long-context video understanding in multimodal LLMs by introducing a multi-granularity video representation that preserves both spatial fidelity and temporal coherence. It employs an Intra-chunk Vision Encoder and an Inter-chunk Feature Aggregator to densely encode high-resolution video content into latent tokens while handling images as single-frame videos, all within a multi-stage training pipeline. The approach leverages chunk-level Rotary Encoding and a dynamic resolution strategy to manage variable video lengths and resolutions, achieving strong performance with efficient inference. Comprehensive experiments and ablations demonstrate robust temporal-spatial reasoning, favorable cost-efficiency, and clear advantages over existing sampling and compression strategies.

Abstract

Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose $\mathbf{Mavors}$, a novel framework that introduces $\mathbf{M}$ulti-gr$\mathbf{a}$nularity $\mathbf{v}$ide$\mathbf{o}$ $\mathbf{r}$epre$\mathbf{s}$entation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.

Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

TL;DR

Mavors tackles long-context video understanding in multimodal LLMs by introducing a multi-granularity video representation that preserves both spatial fidelity and temporal coherence. It employs an Intra-chunk Vision Encoder and an Inter-chunk Feature Aggregator to densely encode high-resolution video content into latent tokens while handling images as single-frame videos, all within a multi-stage training pipeline. The approach leverages chunk-level Rotary Encoding and a dynamic resolution strategy to manage variable video lengths and resolutions, achieving strong performance with efficient inference. Comprehensive experiments and ablations demonstrate robust temporal-spatial reasoning, favorable cost-efficiency, and clear advantages over existing sampling and compression strategies.

Abstract

Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose , a novel framework that introduces ulti-grnularity ide epreentation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.

Paper Structure

This paper contains 23 sections, 8 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: (a) Sparse sampling, which remains the high resolution but loses many details in the unsampled frames; (b) Dense sampling with low resolution, which understands the videos from a large number of frames but would confuse on the low-resolution content; (c) Dense sampling with token compression, which keeps the key tokens on the main characters but suffers from hallucinations owing to the missing of visual tokens; (d) Our Mavors, balancing the demands of resolution and number of frames. Though all these approaches could perform similarly on Video-MME, Mavors significantly improves the caption capability on complex scenes. Note that the words in red and green denote incorrect and correct details, respectively.
  • Figure 2: The impact of the number of frames (720P).
  • Figure 3: The impact of the resolution of frames (64 frames).
  • Figure 4: The architecture of Mavors.
  • Figure 5: The dynamic resolution strategy in Mavors.
  • ...and 9 more figures