Table of Contents
Fetching ...

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

Jinlong Li, Liyuan Jiang, Haonan Zhang, Nicu Sebe

TL;DR

A new perspective is proposed that elaborates token anchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors.

Abstract

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \href{https://tyroneli.github.io/AOT}{AOT}.

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

TL;DR

A new perspective is proposed that elaborates token anchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors.

Abstract

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \href{https://tyroneli.github.io/AOT}{AOT}.
Paper Structure (23 sections, 30 equations, 6 figures, 8 tables, 2 algorithms)

This paper contains 23 sections, 30 equations, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: The top is the essential differences compared with common token reduction methods, instead of simply removing unimportant or merging very similar tokens, ours utilizes a global optimization strategy to further exploit and aggregate necessary semantic and context from these onto the remaining tokens. Bottom is our proposed pipeline to adopt Optimal Transport to aggregate information within intra- and inter-frame levels for video tokens.
  • Figure 2: Overall pipeline of our AOT. Our method compresses tokens of video LLMs across spatiotemporal through optimal transport, first establishing token anchors within each frame to cover semantically important and spatially diverse token candidates, then utilizing optimal transport to aggregate the necessary informative cues within Intra-Frame at phase I, and finally shifting the optimization strategy into temporal within Inter-Frame at phase II. The proposed AOT preserves both temporal and visual integrity by utilizing efficient Sinkhorn-Knopp Iteration to solve the optimal transport plan assignment.
  • Figure 3: Left: scaling with more frames leads to more efficient and effective visual information abstraction. Right: sensitivity analysis of weighting coefficient controlling contextual contribution with consistent configuration, $\lambda_{intra}$ and $\lambda_{inter}$.
  • Figure 4: Qualitative visualizations of our Local-Global token anchors evolution across consecutive frames while optimal transport is adopted to aggregate necessary information from unselected tokens to help LLM precess better.
  • Figure 5: Qualitative visualizations of our Local-Global token anchors evolution across consecutive frames on MVBench sample while optimal transport is adopted to aggregate necessary information from unselected tokens to help LLM precess better. The top is the original sampled frames while the bottom is the corresponding tokens visualization.
  • ...and 1 more figures