Table of Contents
Fetching ...

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, Zuxuan Wu

TL;DR

To adapt effectively to dynamic scenes, FluxMem introduces a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning.

Abstract

This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

TL;DR

To adapt effectively to dynamic scenes, FluxMem introduces a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning.

Abstract

This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.
Paper Structure (25 sections, 5 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 5 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of FluxMem: Adaptive Hierarchical Memory. Each incoming frame is encoded into visual tokens and written to FluxMem in a cascaded short–mid–long process. On short-term memory overflow, Temporal Adjacency Selection ($\mathrm{TAS}$) retains temporally variant tokens for mid-term memory; on mid-term memory overflow, Spatial Domain Consolidation ($\mathrm{SDC}$) merges spatially redundant regions into compact anchors for long-term memory. The overflow process is guided by distribution-adaptive thresholds, autonomously calibrating retention strength to the video's temporal dynamics. Notably, the similarity metric against the preceding frame, required for $\mathrm{TAS}$, is computed upon the token's entry into the short-term memory, enabling it to serve as a zero-overhead trigger for active LLM output.
  • Figure 2: Ablation of FluxMem. (a) Method comparison across drop ratios on the MLVU dataset. (b) and (c) Comparison of our adaptive and fixed thresholds in the mid- and long-term memory banks. The cosine distance of each token is compared against these thresholds to determine whether it is kept or dropped. The shaded area presents the distribution of average per-video drop ratios for the adaptive and optimal fixed thresholds, aggregated across all videos in the MLVU benchmark.