Table of Contents
Fetching ...

Clapper: Compact Learning and Video Representation in VLMs

Lingyu Kong, Hongzhi Zhang, Jingyuan Zhang, Jianzhao Huang, Kunze Li, Qi Wang, Fuzheng Zhang

TL;DR

Clapper tackles temporal modeling and token efficiency in video Vision-Language Models by introducing a slow-fast video representation and a TimePerceiver module that compresses per-frame visuals to an average of 61 tokens, achieving a 13x reduction. It leverages a pretrained image encoder to capture high-resolution keyframe details and TimePerceiver to encode temporal dynamics, enabling compact learning within a VLM backbone like Qwen2. Evaluations on six video QA benchmarks demonstrate competitive performance under fixed token budgets, notably in the >4x compression regime, while maintaining practical token overhead. The work also presents a two-stage training regimen and a fair evaluation protocol under a fixed visual token upper bound, highlighting practical considerations for deploying video VLMs. Limitations include 1fps training, length extrapolation constraints, and absence of RLHF alignment, suggesting avenues for further optimization and longer-context capabilities.

Abstract

Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.

Clapper: Compact Learning and Video Representation in VLMs

TL;DR

Clapper tackles temporal modeling and token efficiency in video Vision-Language Models by introducing a slow-fast video representation and a TimePerceiver module that compresses per-frame visuals to an average of 61 tokens, achieving a 13x reduction. It leverages a pretrained image encoder to capture high-resolution keyframe details and TimePerceiver to encode temporal dynamics, enabling compact learning within a VLM backbone like Qwen2. Evaluations on six video QA benchmarks demonstrate competitive performance under fixed token budgets, notably in the >4x compression regime, while maintaining practical token overhead. The work also presents a two-stage training regimen and a fair evaluation protocol under a fixed visual token upper bound, highlighting practical considerations for deploying video VLMs. Limitations include 1fps training, length extrapolation constraints, and absence of RLHF alignment, suggesting avenues for further optimization and longer-context capabilities.

Abstract

Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.

Paper Structure

This paper contains 18 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Architecture of Clapper. The model consists of a vision encoder, a TimePerceiver module, an MLP layer, and an LLM. Input videos are sampled and divided into segments, with each segment represented by a combination of high-resolution keyframe embedding and compressed temporal embedding.
  • Figure 2: The TimePerceiver module processes 2-4 frames to generate a fixed number of temporal embedding outputs (49 in this paper).
  • Figure 3: Performance of Clapper under different frames on the eight video QA benchmarks.
  • Figure 4: Comparison of video captioning results using Clapper and others. Key points are displayed in bold. Details uniquely captured by Clapper, which are absent in other models, are highlighted in bold green.