Table of Contents
Fetching ...

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

Weijie Zhou, Xuantang Xiong, Zhenlin Hu, Xiaomeng Zhu, Chaoyang Zhao, Honghui Dong, Zhengyou Zhang, Ming Tang, Jinqiao Wang

TL;DR

EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.

Abstract

In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}$\to$\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

TL;DR

EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.

Abstract

In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.
Paper Structure (18 sections, 3 equations, 5 figures, 4 tables)

This paper contains 18 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: From text-sufficient grounding to deictic co-speech event binding.Left: In many existing embodied/grounding benchmarks, the instruction is semantically exhaustive (e.g., attributes and spatial relations), so the correct referent can be inferred from text alone and co-speech gesture video is largely optional. Right:EcoG models natural deictic collaboration, where utterances are intentionally underspecified (e.g., “put this in it”) and become solvable only by aligning each deictic phrase to a brief co-speech pointing stroke on the video timeline. Successful EcoG grounding requires within-clip event assignment: binding each phrase to the correct stroke, then producing an executable intent for every step (What target, Where actionable 2D point, and When stroke time).
  • Figure 2: EcoG task overview. Given an egocentric video clip with synchronized audio, the model must ground each deictic referent in the instruction by outputting an ordered list of triplets: What (an index in a clip-specific closed-set of candidate options), Where (a 2D point on the last frame, ensuring an actionable “landing point”), and When (an integer timestamp in milliseconds from clip start that must fall inside the annotated gesture-stroke window that disambiguates the referent).
  • Figure 3: Progressive Cognitive Evaluation protocol and dataset composition. EcoG-Bench organizes 811 egocentric clips (EN/ZH) into four levels with increasing compositionality and event-assignment difficulty: L1 silent deictic pointing (K=1), L2 single-event co-speech binding (K=1), L3 dual-event deictic assignment (K=2), and L4 multi-event intent chaining (K=3–4). The figure illustrates the corresponding instruction templates and the increasing requirement to assign each deictic phrase to the correct within-clip gesture stroke.
  • Figure 4: Qualitative results on EcoG-Bench. Examples of model predictions versus ground truth under strict What/Where/When evaluation. The shown cases highlight typical failure modes: correct recognition but inaccurate pointing on small/occluded objects, and mis-binding a deictic phrase to a nearby (but incorrect) stroke event—both of which render the output non-executable under conjunctive Eco-Accuracy.
  • Figure 5: Failure bottleneck analysis of EcoG. Breakdown of errors by which components of the executable grounding triplet fail (What, Where, When) and their combinations. Joint failures (e.g., Where+When) constitute a large portion of errors, indicating that EcoG difficulty is dominated by cross-modal event binding rather than isolated object classification alone.