Table of Contents
Fetching ...

VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

Fufangchen Zhao, Liao Zhang, Daiqi Shi, Yuanjun Gao, Chen Ye, Yang Cai, Jian Gao, Danfeng Yan

TL;DR

VideoPerceiver tackles the challenge of fine-grained temporal perception in video-language models by introducing a two-stage training framework that combines key-information-absent video construction, intermediate-layer contrastive learning, and a comparative reinforcement learning objective based on GRPO. A dedicated 80K video dataset supports SFT and RL training, enabling the model to recover temporally precise action details in both short clips and long videos. Empirical results show state-of-the-art performance on fine-grained action understanding (MotionBench) and transient event perception (VRU-Accident), while maintaining strong generalization on standard video-language benchmarks. The work advances video-language modeling by prioritizing task-relevant visual features and providing a principled approach to learning fine-grained temporal semantics.

Abstract

We propose VideoPerceiver, a novel video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding, addressing VMLLMs' limited ability to reason about brief actions in short clips or rare transient events in long videos. VideoPerceiver adopts a two-stage training framework. During supervised fine-tuning (SFT), we construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames. We jointly encode original and modified video tokens with text tokens, aligning intermediate visual representations with keywords via an auxiliary contrastive loss to enhance sensitivity to fine-grained motion cues. In reinforcement learning (RL), both video variants are fed into the model to generate descriptions, and a novel relative reward ensures responses from complete videos outperform those from degraded inputs, explicitly training the model to recover temporally precise action details. We also curate a dataset of 80,000 videos with fine-grained actions and transient events. Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks, while maintaining strong performance on standard tasks. By prioritizing task-relevant visual features, our work redefines video-language model training for fine-grained perception.

VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

TL;DR

VideoPerceiver tackles the challenge of fine-grained temporal perception in video-language models by introducing a two-stage training framework that combines key-information-absent video construction, intermediate-layer contrastive learning, and a comparative reinforcement learning objective based on GRPO. A dedicated 80K video dataset supports SFT and RL training, enabling the model to recover temporally precise action details in both short clips and long videos. Empirical results show state-of-the-art performance on fine-grained action understanding (MotionBench) and transient event perception (VRU-Accident), while maintaining strong generalization on standard video-language benchmarks. The work advances video-language modeling by prioritizing task-relevant visual features and providing a principled approach to learning fine-grained temporal semantics.

Abstract

We propose VideoPerceiver, a novel video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding, addressing VMLLMs' limited ability to reason about brief actions in short clips or rare transient events in long videos. VideoPerceiver adopts a two-stage training framework. During supervised fine-tuning (SFT), we construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames. We jointly encode original and modified video tokens with text tokens, aligning intermediate visual representations with keywords via an auxiliary contrastive loss to enhance sensitivity to fine-grained motion cues. In reinforcement learning (RL), both video variants are fed into the model to generate descriptions, and a novel relative reward ensures responses from complete videos outperform those from degraded inputs, explicitly training the model to recover temporally precise action details. We also curate a dataset of 80,000 videos with fine-grained actions and transient events. Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks, while maintaining strong performance on standard tasks. By prioritizing task-relevant visual features, our work redefines video-language model training for fine-grained perception.

Paper Structure

This paper contains 20 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The illustration of the proposed VideoPerceiver.
  • Figure 2: Key-information-absent video construction (left) and special frame sampling strategy for opportunity transient events (right).
  • Figure 3: Schematic diagram of comparative learning in the intermediate layer.
  • Figure 4: Schematic diagram of Comparative GRPO.
  • Figure 5: Summary of dataset distribution. a) The composition details of the dataset, where we will merge datasets that account for less than 1% and replace them with the "Other" label. b) Train to understand the distribution of the MM-AU dataset using transient events. Most of the key events in the data have a low duration and account for less than 15% of the total duration. Consistent with the understanding of transient events in real life.