Table of Contents
Fetching ...

DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding

Yudong Han, Qingpei Guo, Liyuan Pan, Liu Liu, Yu Guan, Ming Yang

TL;DR

DynFocus tackles the memory bottleneck in LLM-based video understanding by revealing redundancy and question-dependent frame relevance in long videos. It introduces Dynamic Event Prototype Estimation (DPE) to select meaningful frames and Compact Cooperative Encoding (CCE) to encode important frames with fine-grained features (Cones) while summarizing others with coarse, text-guided tokens (Rods). The two-stage training regime aligns video content with language and then fine-tunes an LLM on instruction-following data, achieving competitive or superior results on short- and long-video benchmarks while using far fewer tokens. The approach also demonstrates robustness to video hallucination and shows clear efficiency advantages over state-of-the-art methods, making it attractive for scalable video-language applications. Overall, DynFocus provides a memory-efficient, dynamically adjustable framework that preserves crucial visual details and temporal cues essential for accurate video understanding in LLM-driven systems.

Abstract

The challenge in LLM-based video understanding lies in preserving visual and semantic information in long videos while maintaining a memory-affordable token count. However, redundancy and correspondence in videos have hindered the performance potential of existing methods. Through statistical learning on current datasets, we observe that redundancy occurs in both repeated and answer-irrelevant frames, and the corresponding frames vary with different questions. This suggests the possibility of adopting dynamic encoding to balance detailed video information preservation with token budget reduction. To this end, we propose a dynamic cooperative network, DynFocus, for memory-efficient video encoding in this paper. Specifically, i) a Dynamic Event Prototype Estimation (DPE) module to dynamically select meaningful frames for question answering; (ii) a Compact Cooperative Encoding (CCE) module that encodes meaningful frames with detailed visual appearance and the remaining frames with sketchy perception separately. We evaluate our method on five publicly available benchmarks, and experimental results consistently demonstrate that our method achieves competitive performance.

DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding

TL;DR

DynFocus tackles the memory bottleneck in LLM-based video understanding by revealing redundancy and question-dependent frame relevance in long videos. It introduces Dynamic Event Prototype Estimation (DPE) to select meaningful frames and Compact Cooperative Encoding (CCE) to encode important frames with fine-grained features (Cones) while summarizing others with coarse, text-guided tokens (Rods). The two-stage training regime aligns video content with language and then fine-tunes an LLM on instruction-following data, achieving competitive or superior results on short- and long-video benchmarks while using far fewer tokens. The approach also demonstrates robustness to video hallucination and shows clear efficiency advantages over state-of-the-art methods, making it attractive for scalable video-language applications. Overall, DynFocus provides a memory-efficient, dynamically adjustable framework that preserves crucial visual details and temporal cues essential for accurate video understanding in LLM-driven systems.

Abstract

The challenge in LLM-based video understanding lies in preserving visual and semantic information in long videos while maintaining a memory-affordable token count. However, redundancy and correspondence in videos have hindered the performance potential of existing methods. Through statistical learning on current datasets, we observe that redundancy occurs in both repeated and answer-irrelevant frames, and the corresponding frames vary with different questions. This suggests the possibility of adopting dynamic encoding to balance detailed video information preservation with token budget reduction. To this end, we propose a dynamic cooperative network, DynFocus, for memory-efficient video encoding in this paper. Specifically, i) a Dynamic Event Prototype Estimation (DPE) module to dynamically select meaningful frames for question answering; (ii) a Compact Cooperative Encoding (CCE) module that encodes meaningful frames with detailed visual appearance and the remaining frames with sketchy perception separately. We evaluate our method on five publicly available benchmarks, and experimental results consistently demonstrate that our method achieves competitive performance.

Paper Structure

This paper contains 34 sections, 13 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: Concept of redundancy and correspondence in our pipeline. (a) The proportion of redundancy for video datasets. Redundancy includes both repeated and answer-irrelevant frames. Repeatance gauges the redundancy between consecutive frames, while answer-irrelevance refers to frames with a marginal contribution to question answering. (b) An example of correspondence. Given a video, we highlight the corresponding question/answer pairs and frames using red and blue boxes, respectively.
  • Figure 2: Schematic Illustration of DynFocus. Our method takes the user instruction and video frames as input, and yields the compact video tokens from CCE module for LLM. Specifically, DPE module serves as the selector to identify the prototypes that contribute greatly to answer, providing CCE module with event prototype $\{\mathbf{h}_{k}\}_{k=1}^{K}$ and the binary mask $\{b_{t}\}_{t=1}^{T}$, which is marked with two red arrows. Benefited from this, CCE module dynamically encode the critical prototypes with more tokens, and encapsulate the marginal prototypes with few tokens. T-DPC and S-DPC represent the DPC-KNN clustering temporally and spatially, respectively.
  • Figure 3: (a) and (b) illustrate the performance with different number of event prototypes and different ratio of filtered event prototypes, respectively.
  • Figure 4: Token number comparison with different methods on different benchmark datasets. We calculate their token number using their released code snippet regarding loading video.
  • Figure 5: We showcase the filtered event prototypes focused by DPE module on LV-Bench. To save space, we only showcase the prototype with top-6 score sequentially. The figure at the right-bottom corner illustrates the learned score distribution on the event prototype candidates ($L$=25) obtained by DPC-KNN.
  • ...and 4 more figures