Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng; Xin Ding; Yifan Yang; Shiqi Jiang; Hao Wu; Qianxi Zhang; Weijun Wang; Ting Cao; Yunxin Liu

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng, Xin Ding, Yifan Yang, Shiqi Jiang, Hao Wu, Qianxi Zhang, Weijun Wang, Ting Cao, Yunxin Liu

Abstract

Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Abstract

Paper Structure (40 sections, 13 equations, 12 figures, 7 tables)

This paper contains 40 sections, 13 equations, 12 figures, 7 tables.

Introduction
Related Work
Methodology
Problem Formulation
The Overview of Em-Garde
Instruction-Guided Proposal Parser
Properties of the proposals
Learning effective proposals
Lightweight Proposal Matching Module
Computational Efficiency
Experiments
Experiment Setup
Implementation Details
Benchmarks for Evaluation
Results on Proactive Response tasks
...and 25 more sections

Figures (12)

Figure 1: Demonstration of our model v.s. existing Streaming VideoLLMs on the Proactive Streaming Understanding task. While existing models solve a complicated response/silence decision-making problem at every timestep, we turn the problem into a simple perception problem with query-time semantic parsing, allowing for efficient and accurate proactive response timing.
Figure 2: Overview of the Em-Garde Framework:IGPP (Orange) receives the Instrcution $I$ and a low-fps video context before query time, and parse the instruction into perceptually-grounded visual cues. LPMM (Blue) runs in the streaming loop, matching the current sliding-window video segment to the proposal in the embedding space. The similarity scores are utilized as the temporal signal for response triggering decision. Together, the two modules separate semantic understanding from the streaming loop, allowing for efficient and accurate response time decision.
Figure 3: Throughput-Performance comparison on the Proactive Streaming Understanding task. The performance is measured by the average F1 Score on OVO-Bench. Models with * means that the throughput degrade as the context grows without KV Cache truncation. We show the throughput without degrading.
Figure 4: Comparison between proposal content (bottom left) and triggering behavior (bottom right) after SFT and RL. The proposals given by the IGPP before and after RL are shown on the bottom left. The increase of LPMM similarity score and the response times are shown on the bottom right. Due to the different IGPP behaviors, LPMM successfully found a match at the target timestamp for the proposals after RL, while matches at a wrong timestamp (118s) for the proposals before RL.
Figure 5: Recall-Precision plot for varied detection thresholds across OVO-Bench Forward Active Response Tasks.
...and 7 more figures

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Abstract

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Authors

Abstract

Table of Contents

Figures (12)