Table of Contents
Fetching ...

Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation

Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxiang Shi, Liqiang Nie

TL;DR

The paper tackles long-term action anticipation in egocentric video by introducing INSIGHT, a two-stage framework that first emphasizes hand-object interaction (HOI) cues and verb-noun semantics, then applies an explicit cognitive reasoning module trained with a structured reward strategy. The HOI-Augmented Feature Extraction and verb-noun co-occurrence correction provide rich, coherent action representations, which feed a reinforcement learning-based think–reason–answer policy to forecast future actions. Through extensive experiments on Ego4D, EK-55, and EGTEA Gaze+, INSIGHT achieves state-of-the-art performance, with notable improvements in rare action classes and strong generalization across benchmarks. The work demonstrates the value of integrating fine-grained visual grounding with explicit, task-aware cognitive reasoning for robust long-horizon anticipation in real-world, egocentric settings.

Abstract

Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) -> intention inference (reason) -> action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.

Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation

TL;DR

The paper tackles long-term action anticipation in egocentric video by introducing INSIGHT, a two-stage framework that first emphasizes hand-object interaction (HOI) cues and verb-noun semantics, then applies an explicit cognitive reasoning module trained with a structured reward strategy. The HOI-Augmented Feature Extraction and verb-noun co-occurrence correction provide rich, coherent action representations, which feed a reinforcement learning-based think–reason–answer policy to forecast future actions. Through extensive experiments on Ego4D, EK-55, and EGTEA Gaze+, INSIGHT achieves state-of-the-art performance, with notable improvements in rare action classes and strong generalization across benchmarks. The work demonstrates the value of integrating fine-grained visual grounding with explicit, task-aware cognitive reasoning for robust long-horizon anticipation in real-world, egocentric settings.

Abstract

Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) -> intention inference (reason) -> action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.

Paper Structure

This paper contains 33 sections, 19 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of the explicit cognitive reasoning process for long-term action anticipation.
  • Figure 2: Overview of our two-stage framework, INSIGHT. Stage 1: Hand-Object Semantic Action Recognition leverages a HOI-augmented feature extraction module to focus on critical hand-object regions, alongside a semantic co-occurrence transition matrix that captures verb-noun relationships. The resulting enriched visual features are formatted as prompts for VLM. Stage 2: Explicit Cognitive Reasoning for Anticipation introduces a reinforcement learning-based intention inference mechanism that simulates a structured three-step cognitive process, i.e., visual perception (think), intention inference (reason), and action anticipation (answer), to enable dynamic reasoning and generate accurate long-term predictions.
  • Figure 3: Illustration of the structured prompt used during GRPO training.
  • Figure 4: Case study on Ego4D for the LTA task. INSIGHT demonstrates intention-aware predictions with improved temporal coherence, fewer redundant actions, and more precise verb-noun pairings, leading to lower edit distance.
  • Figure 5: Training visualization of the GRPO-based cognitive reasoning process.
  • ...and 1 more figures