Table of Contents
Fetching ...

D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Yiyang Huang, Yizhou Wang, Yun Fu

TL;DR

D-CoDe presents a training-free framework to scale image-pretrained vision-language models to video by addressing two bottlenecks: the perception bottleneck and token overload. It combines dynamic compression—adaptive frame selection and spatial token pruning/merging guided by semantic content—with question decomposition to reformulate queries into focused sub-questions, enabling better utilization of visual tokens. Across multiple VideoQA benchmarks, D-CoDe delivers strong gains, notably surpassing some training-required methods on EgoSchema and achieving top open-ended results on short- and long-form videos. The work demonstrates a practical, scalable path to leverage image-based VLMs for diverse video understanding tasks, while acknowledging limitations in highly dynamic scenes and suggesting future enhancements such as slow-fast integration and memory-augmented temporal reasoning.

Abstract

Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.

D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

TL;DR

D-CoDe presents a training-free framework to scale image-pretrained vision-language models to video by addressing two bottlenecks: the perception bottleneck and token overload. It combines dynamic compression—adaptive frame selection and spatial token pruning/merging guided by semantic content—with question decomposition to reformulate queries into focused sub-questions, enabling better utilization of visual tokens. Across multiple VideoQA benchmarks, D-CoDe delivers strong gains, notably surpassing some training-required methods on EgoSchema and achieving top open-ended results on short- and long-form videos. The work demonstrates a practical, scalable path to leverage image-based VLMs for diverse video understanding tasks, while acknowledging limitations in highly dynamic scenes and suggesting future enhancements such as slow-fast integration and memory-augmented temporal reasoning.

Abstract

Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.

Paper Structure

This paper contains 30 sections, 13 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Adapting image-pretrained VLMs to video faces two major challenges: the perception bottleneck, in which salient information is unevenly distributed across spatial and temporal dimensions, limiting the effectiveness of static compression in preserving key visual cues; and token overload, where video inputs yield substantially more visual tokens than images, exceeding the model's capacity for comprehensive understanding.
  • Figure 2: (a) Static compression treats all content uniformly, discarding informative cues that are dynamically distributed across temporal and spatial dimensions, thereby limiting fine-grained perception. In contrast, dynamic compression better preserves key visual cues across both dimensions. (b) As the number of input tokens increases, the accuracy of the baseline saturates, indicating limited utility of excessive tokens. In contrast, question decomposition consistently expands the accuracy gap, demonstrating its ability to more effectively leverage large token inputs.
  • Figure 3: The D-CoDe pipeline consists of two components: dynamic compression and question decomposition. Dynamic compression augments temporal uniform sampling by selecting supplementary frames to retain informative segments, then discards uninformative spatial tokens and merges semantically similar ones to reduce redundancy while preserving essential visual information. Question decomposition reformulates complex queries into sub-questions, guiding the model to attend to diverse aspects of the video and enabling comprehensive understanding.
  • Figure 4: To mitigate the temporal perception bottleneck, that is, to avoid missing informative video content, supplementary frames are selected based on their semantic dissimilarity to uniformly sampled ones, where similarity is measured using global features extracted by the CLIP visual encoder.
  • Figure 5: To mitigate the spatial perception bottleneck, spatial tokens are first pruned based on their $\ell_2$ activation magnitudes. The remaining informative tokens are then grouped according to cosine similarity and aggregated via mean pooling, thereby reducing redundancy while preserving semantic fidelity.
  • ...and 3 more figures