Table of Contents
Fetching ...

Streaming Detection of Queried Event Start

Cristobal Eyzaguirre, Eric Tang, Shyamal Buch, Adrien Gaidon, Jiajun Wu, Juan Carlos Niebles

TL;DR

SDQES introduces streaming detection of queried event starts, addressing real-time multimodal video understanding with open-vocabulary natural language queries in egocentric video. The authors formalize the streaming task, propose EgoSDQES as a benchmark, and develop metrics (Streaming Recall and Streaming Minimum Distance) to capture latency-accuracy trade-offs. They adapt vision-language foundation models with streaming adapters (including ST-, QR-, and RN-Adapters) and demonstrate that temporal adapters yield strong performance with modest computational overhead across multiple backbones. The work enables low-latency, flexible event-start detection with practical implications for embodied applications such as robotics and AR, while acknowledging dataset biases and calling for future improvements in data quality and scalability.

Abstract

Robotics, autonomous driving, augmented reality, and many embodied computer vision applications must quickly react to user-defined events unfolding in real time. We address this setting by proposing a novel task for multimodal video understanding-Streaming Detection of Queried Event Start (SDQES). The goal of SDQES is to identify the beginning of a complex event as described by a natural language query, with high accuracy and low latency. We introduce a new benchmark based on the Ego4D dataset, as well as new task-specific metrics to study streaming multimodal detection of diverse events in an egocentric video setting. Inspired by parameter-efficient fine-tuning methods in NLP and for video tasks, we propose adapter-based baselines that enable image-to-video transfer learning, allowing for efficient online video modeling. We evaluate three vision-language backbones and three adapter architectures on both short-clip and untrimmed video settings.

Streaming Detection of Queried Event Start

TL;DR

SDQES introduces streaming detection of queried event starts, addressing real-time multimodal video understanding with open-vocabulary natural language queries in egocentric video. The authors formalize the streaming task, propose EgoSDQES as a benchmark, and develop metrics (Streaming Recall and Streaming Minimum Distance) to capture latency-accuracy trade-offs. They adapt vision-language foundation models with streaming adapters (including ST-, QR-, and RN-Adapters) and demonstrate that temporal adapters yield strong performance with modest computational overhead across multiple backbones. The work enables low-latency, flexible event-start detection with practical implications for embodied applications such as robotics and AR, while acknowledging dataset biases and calling for future improvements in data quality and scalability.

Abstract

Robotics, autonomous driving, augmented reality, and many embodied computer vision applications must quickly react to user-defined events unfolding in real time. We address this setting by proposing a novel task for multimodal video understanding-Streaming Detection of Queried Event Start (SDQES). The goal of SDQES is to identify the beginning of a complex event as described by a natural language query, with high accuracy and low latency. We introduce a new benchmark based on the Ego4D dataset, as well as new task-specific metrics to study streaming multimodal detection of diverse events in an egocentric video setting. Inspired by parameter-efficient fine-tuning methods in NLP and for video tasks, we propose adapter-based baselines that enable image-to-video transfer learning, allowing for efficient online video modeling. We evaluate three vision-language backbones and three adapter architectures on both short-clip and untrimmed video settings.

Paper Structure

This paper contains 43 sections, 13 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of our proposed SDQES task. The goal of streaming detection of queried event start (SDQES) is for a system to detect the start of a complex event, described by natural language, with low latency from a streaming video input. This task is a novel intersection of multimodal event and online/streaming video understanding benchmarks. It is intended to encourage the design of new streaming multimodal models for challenging egocentric or embodied settings (e.g., assistive robotics, augmented reality) where time-sensitivity is a key concern for safety, accessibility, or convenience.
  • Figure 2: Example videos and queries from our dataset EgoSDQES.
  • Figure 3: Dataset generation pipeline. Left: we show the generation pipeline steps for an example video with dense captions. Right: Sankey diagram illustrates the flow of data from Ego4D through the various filtering stages. Asterisk ($*$) encodes a filter based on query specificity.
  • Figure 4: Dataset statistics. Left: Event duration in seconds. Center: Distribution of Event Start with respect to video start. Right: Word Cloud of the query generations.
  • Figure 5: Overview of the Streaming-Adapter. (a) Intervened Block: the lock icon denotes frozen parameters - only adapter parameters are trained. Temporal adapters operate on a reduced dimension for efficiency. (b) Adapter Internals: the adapter operates over the temporal dimension and consists of temporal aggregation layers. The final state of the model is stored for when the next frame arrives.
  • ...and 2 more figures