Table of Contents
Fetching ...

Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

Manuel Benavent-Lledo, Konstantinos Bacharidis, Victoria Manousaki, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez

TL;DR

The paper demonstrates that a single RGB frame augmented with depth and carefully designed action history priors can rival temporally aggregated video methods for action anticipation. By employing DINOv2 features, Depth Anything V2, cross-attention fusion, and multiple action-history strategies (including per-action textual embeddings), AAG achieves competitive results on IKEA-ASM, Meccano, and Assembly101. The work provides a nuanced view of when short-term multimodal cues suffice versus when long-range temporal modeling is essential, and highlights the importance of high-quality action history generation. These findings suggest that efficient, single-frame approaches can be viable alternatives in structured industrial contexts, with potential for further gains via dataset-specific cues and memory mechanisms for past actions.

Abstract

Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.

Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

TL;DR

The paper demonstrates that a single RGB frame augmented with depth and carefully designed action history priors can rival temporally aggregated video methods for action anticipation. By employing DINOv2 features, Depth Anything V2, cross-attention fusion, and multiple action-history strategies (including per-action textual embeddings), AAG achieves competitive results on IKEA-ASM, Meccano, and Assembly101. The work provides a nuanced view of when short-term multimodal cues suffice versus when long-range temporal modeling is essential, and highlights the importance of high-quality action history generation. These findings suggest that efficient, single-frame approaches can be viable alternatives in structured industrial contexts, with potential for further gains via dataset-specific cues and memory mechanisms for past actions.

Abstract

Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.

Paper Structure

This paper contains 21 sections, 4 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: How much information is necessary to anticipate the next action accurately? Is a single-frame, short-term observation sufficient, or does long-term context from past actions improve prediction? We demonstrate that fusing a single frame with appropriate contextual modalities enables effective anticipation without relying on video aggregation. The frame is extracted from the IKEA-ASM Ben-Shabat_2021_WACV dataset, and the next action is attach shelf to table.
  • Figure 2: AAG architecture with proposed action history strategies. Given a frame $x_T$, captured $\delta$ seconds before an action, the visual fusion module combines RGB and depth embeddings from a frozen extractor (depth mapped to RGB for encoding). Visual features are fused with past actions via a self-attention transformer. We test three history encoding methods: (1) prompting a vision-language model (blue), (2) generating past action descriptions (red), and (3) separately encoding action classes before fusion (yellow), which performs best.
  • Figure A: Comparison of original depth frames and Depth Anything v2 estimates on the IKEA-ASM dataset. The higher noise in the original depth frames causes reduced performance when fused with RGB features.
  • Figure B: Qualitative comparison of depth frames from DAv2 against the original RGB images across different benchmark datasets. The third-person perspective in IKEA-ASM clearly distinguishes background and foreground elements, enhancing depth's utility. In contrast, the closer camera views in Meccano and Assembly101, where most objects and interactions occur within a nearly planar workspace, limit the effectiveness of depth.
  • Figure C: Qualitative comparison of VLMs and VQA methods on 3 different prompts. Frame extracted from IKEA-ASM.
  • ...and 1 more figures