Table of Contents
Fetching ...

SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

Xiang Li, Heqian Qiu, Lanxiao Wang, Benliu Qiu, Fanman Meng, Linfeng Xu, Hongliang Li

Abstract

Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.

SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

Abstract

Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize EgoExo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.
Paper Structure (26 sections, 18 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 18 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Top: Schematic of the Ego→Exo imitation-error detection task. The system localizes steps on the ego timeline and judges each step by semantic adherence to the exocentric demonstration, rather than rigid speed/pose matching. Bottom-left: Baseline exhibits a counterintuitive performance drop as the number of input frames increases, partly because redundant frames in the videos introduce distraction. Bottom: There is a pronounced domain shift between Ego and Exo, the distribution of similarities between video-level features of demonstration–imitation pairs is overly dispersed. Bottom-right: A key challenge is how to effectively fuse information from Ego and Exo videos to accomplish the task.
  • Figure 2: Overview of SAVA-X. (1) A frozen video encoder extracts per-frame features from the exocentric demonstration and egocentric imitation streams. We apply gated adaptive sampling (Sec. \ref{['sec:sampling']})—hard Top-K with residual gating, using self-attention scoring for Exo and Exo-conditioned cross-attention scoring for Ego to select key segments. (2) We inject scene-aware dictionary view embeddings (Sec. \ref{['sec:view-embed']}) together with temporal positions (multi-level), regularized by attention-entropy and prototype-diversity terms, to mitigate cross-view domain shifts. (3) We perform bidirectional cross-attention fusion (Sec. \ref{['sec:bixattn']}) with learnable gating to align and aggregate complementary cues, yielding a fused sequence that a deformable Transformer encoder–decoder converts into first-person temporal spans and imitation-correctness predictions. Training uses Hungarian set prediction with $\mathcal{L}_{\text{DVC}}$ and $\mathcal{L}_{\text{Imit}}$.
  • Figure 3: Qualitative visualization examples of Ego to Exo imitation error localization. (a): Exocentric demonstration and egocentric imitation with corresponding frame saliency maps. The deeper the red, the more significant. (b): Ground truth (GT) and baseline vs. SAVA-X. Red represents error steps while green represents right steps.
  • Figure 4: Performance under different AS k-ratio at 1 fps and 5 fps (dashed = w/o AS).
  • Figure 5: Relative gain vs. dictionary size for scene-aware view embeddings on the EgoMe validation split. Dashed lines indicate baselines, gray one without view embeddings, black one with fixed learnable view embeddings.
  • ...and 2 more figures