Table of Contents
Fetching ...

STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

Junho Kim, Hosu Lee, James M. Rehg, Minsu Kim, Yong Man Ro

Abstract

Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.

STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

Abstract

Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.

Paper Structure

This paper contains 54 sections, 4 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Overview of STRIDE, which operates in a streaming setting where frames arrive online. A lightweight activation model based on masked diffusion maintains an activation region over a sliding temporal window and iteratively denoises masked activation states to predict a coherent trigger segment. A trigger is issued only if an active span is sustained for a predefined span ratio. When activation is triggered, the accumulated frame context is forwarded to a downstream Video-LLM to generate the response.
  • Figure 2: Activation modeling and inference stage of STRIDE. Training applies sequence duplication and three masking strategies (boundary-anchored masking, span unmasking, full masking). During inference, the activation window slides with incoming frames, retaining confident past decisions while selectively re-masking and progressively denoising uncertain positions.
  • Figure 3: Activation transition frequency results around event boundaries on ET-Bench TVG. Baseline-AR model shows frequent oscillations near boundaries, whereas STRIDE produces more robust activation spans.
  • Figure 4: Trade-off between ET-Bench performance (mean F1) and inference latency for denoising step $K$.
  • Figure 5: Sensitivity of STRIDE to the retention constant$\tau$ across five temporal understanding tasks (TVG, EPM, TAL, DVC, SLC) in ET-Bench liu2024bench. The $y$-axis shows the score difference relative to each task's average. Task-wise average scores are shown in the legend.
  • ...and 14 more figures