Table of Contents
Fetching ...

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, Benoit R. Cottereau

TL;DR

This work introduces Talk2Event, the first large-scale benchmark for grounding objects in dynamic event-camera data using natural language. It formalizes visual grounding from asynchronous event streams and provides rich attribute annotations (Appearance, Status, Relation-to-Viewer, Relation-to-Others) to capture spatiotemporal cues, along with a dataset built atop real driving sequences. To tackle the grounding task, the authors propose EventRefer, an attribute-aware framework that employs a Mixture of Event-Attribute Experts (MoEE) to adaptively fuse appearance, motion, and relational cues, enabling robust performance in event-only, frame-only, and event-frame fusion settings. Empirical results show that EventRefer outperforms strong baselines across all modalities and object types, demonstrating superior localization accuracy and interpretability in dynamic scenes, which has direct implications for language-informed perception in autonomous systems.

Abstract

Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes -- appearance, status, relation to viewer, and relation to other objects -- bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

TL;DR

This work introduces Talk2Event, the first large-scale benchmark for grounding objects in dynamic event-camera data using natural language. It formalizes visual grounding from asynchronous event streams and provides rich attribute annotations (Appearance, Status, Relation-to-Viewer, Relation-to-Others) to capture spatiotemporal cues, along with a dataset built atop real driving sequences. To tackle the grounding task, the authors propose EventRefer, an attribute-aware framework that employs a Mixture of Event-Attribute Experts (MoEE) to adaptively fuse appearance, motion, and relational cues, enabling robust performance in event-only, frame-only, and event-frame fusion settings. Empirical results show that EventRefer outperforms strong baselines across all modalities and object types, demonstrating superior localization accuracy and interpretability in dynamic scenes, which has direct implications for language-informed perception in autonomous systems.

Abstract

Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes -- appearance, status, relation to viewer, and relation to other objects -- bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.

Paper Structure

This paper contains 56 sections, 9 equations, 17 figures, 16 tables.

Figures (17)

  • Figure 1: Grounded scene understanding from event streams. This work presents Talk2Event, a novel task for localizing objects from event cameras using natural language, where each unique object in the scene is defined by four key attributes: ①Appearance, ②Status, ③Relation-to-Viewer, and ④Relation-to-Others. We find that modeling these attributes enables precise, interpretable, and temporally-aware grounding across diverse dynamic environments in the real world.
  • Figure 2: Pipeline of dataset curation. We leverage two surrounding frames at $t_0 \pm \Delta t$ to generate context-aware referring expressions at $t_0$, covering appearance, motion, spatial relations, and interactions. Word clouds on the right highlight distinct linguistic patterns across the four grounding attributes.
  • Figure 3: Overview of architecture. Given event stream $\mathbf{E}$, frame $\mathbf{F}$ (optional), and the corresponding referring expression $\mathcal{S}$, we aim to ground the target (object #2 in this example) from the scene using multi-attribute fusion. We first match each attribute’s cue phrase into a token-level map (Sec. \ref{['sec:positive_word_matching']}). The Mixture of Event-Attribute Experts masks, refines, and fuses event–text features to produce the fused representation (Sec. \ref{['sec:moe']}). Multi-Attribute Fusion treats the four attributes as co-located pseudo-targets and, at inference, combines their scores to select the final bounding box. (Sec. \ref{['sec:maf']})
  • Figure 4: Qualitative assessment of grounding approaches on Talk2Event. The ground truth and predicted boxes are denoted in green and blue colors, respectively. See Appendix for more examples.
  • Figure 5: Class-wise attribute expert activations. We visualize the proportion of each attribute experts in MoEE, under two grounding settings. The top-$1$ proportion of each class is highlighted.
  • ...and 12 more figures