Table of Contents
Fetching ...

FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

Janghoon Cho, Jungsoo Lee, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi

TL;DR

Long video understanding is hampered by explosive visual token counts. The paper proposes FLoC, a training-free, model-agnostic, and query-agnostic token-compression method based on the submodular facility location function $f(S) = \sum_{v\in V} \max_{u\in S} \mathrm{sim}(v,u)$ with cosine similarity $\mathrm{sim}(v,u) = \frac{v^\top u}{\|v\|\|u\|}$, solved efficiently by a lazy greedy algorithm under a budget $|S|\le K$. The selected token subset is concatenated with text prompts and fed into video-LMMs, enabling efficient processing of long videos. Extensive experiments on Video-MME, MLVU, and LongVideoBench demonstrate that FLoC surpasses clustering- and pruning-based baselines in accuracy and speed, highlighting its practical impact for scalable long-video understanding in real-world applications.

Abstract

Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, this paper proposes FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing near-optimal performance. Notably, our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution that seamlessly integrates with diverse video-LLMs and existing workflows. Extensive evaluations on large-scale benchmarks, such as Video-MME, MLVU, and LongVideoBench, demonstrate that our framework consistently surpasses recent compression techniques, highlighting not only its effectiveness and robustness in addressing the critical challenges of long video understanding, but also its efficiency in processing speed.

FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

TL;DR

Long video understanding is hampered by explosive visual token counts. The paper proposes FLoC, a training-free, model-agnostic, and query-agnostic token-compression method based on the submodular facility location function with cosine similarity , solved efficiently by a lazy greedy algorithm under a budget . The selected token subset is concatenated with text prompts and fed into video-LMMs, enabling efficient processing of long videos. Extensive experiments on Video-MME, MLVU, and LongVideoBench demonstrate that FLoC surpasses clustering- and pruning-based baselines in accuracy and speed, highlighting its practical impact for scalable long-video understanding in real-world applications.

Abstract

Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, this paper proposes FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing near-optimal performance. Notably, our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution that seamlessly integrates with diverse video-LLMs and existing workflows. Extensive evaluations on large-scale benchmarks, such as Video-MME, MLVU, and LongVideoBench, demonstrate that our framework consistently surpasses recent compression techniques, highlighting not only its effectiveness and robustness in addressing the critical challenges of long video understanding, but also its efficiency in processing speed.

Paper Structure

This paper contains 23 sections, 5 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Performance (Average relative accuracy compared to full token usage) versus compression time (log-scale) for a number of compression algorithms. Details are described in Section \ref{['sec:experiments']}.
  • Figure 2: Overview of the proposed framework for selecting a visual token subset. Our method compresses the visual tokens extracted by a visual encoder from input video sequences into a diverse and representative subset within a given budget. The selected visual tokens are then concatenated with text tokens and fed into the video-LMM. Since our method is training-free and model-agnostic, it can be seamlessly integrated into any video-LMM in a plug-and-play manner.
  • Figure 3: Illustration of the proposed algorithm for selecting a subset of visual tokens using the lazy greedy approach. The process iteratively selects tokens with the highest marginal gain while ensuring diversity and representativeness within a given budget K. This figure demonstrates the execution of \ref{['alg:lazy_greedy']} from line 7 to line 14 on a one-dimensional toy example.
  • Figure 4: TSNE visualization of visual tokens. The red-colored stars and black-colored dots indicate the selected and discarded visual tokens, respectively. As shown, our method selects both representative and diverse visual tokens.
  • Figure 5: FLoC captures diverse visual tokens (e.g., hat, sunglasses) missed by DivPrune and TS-LLaVA, enabling accurate answers about what the woman is wearing.
  • ...and 5 more figures