Table of Contents
Fetching ...

EgoPrompt: Prompt Learning for Egocentric Action Recognition

Huaihai Lyu, Chaofan Chen, Yuheng Ji, Changsheng Xu

TL;DR

<3-5 sentence high-level summary>: EgoPrompt tackles egocentric action recognition by modeling the semantic interplay between verbs and nouns through a Unified Prompt Pool and a Diverse Pool Criteria. It introduces Implicit Pattern Interaction Learning to fuse component-specific prompts and a two-stage training strategy to learn robust cross-component patterns. The approach yields state-of-the-art generalization on Ego4D, Epic-Kitchens, and EGTEA, across within-dataset, cross-dataset, and base-to-novel settings, while providing ablations that highlight the importance of prompt diversity and interaction. This work advances prompt-learning-based generalization in first-person HOI understanding and offers a scalable framework for robust EAR in real-world shifts.

Abstract

Driven by the increasing demand for applications in augmented and virtual reality, egocentric action recognition has emerged as a prominent research area. It is typically divided into two subtasks: recognizing the performed behavior (i.e., verb component) and identifying the objects being acted upon (i.e., noun component) from the first-person perspective. However, most existing approaches treat these two components as independent classification tasks, focusing on extracting component-specific knowledge while overlooking their inherent semantic and contextual relationships, leading to fragmented representations and sub-optimal generalization capability. To address these challenges, we propose a prompt learning-based framework, EgoPrompt, to conduct the egocentric action recognition task. Building on the existing prompting strategy to capture the component-specific knowledge, we construct a Unified Prompt Pool space to establish interaction between the two types of component representations. Specifically, the component representations (from verbs and nouns) are first decomposed into fine-grained patterns with the prompt pair form. Then, these pattern-level representations are fused through an attention-based mechanism to facilitate cross-component interaction. To ensure the prompt pool is informative, we further introduce a novel training objective, Diverse Pool Criteria. This objective realizes our goals from two perspectives: Prompt Selection Frequency Regularization and Prompt Knowledge Orthogonalization. Extensive experiments are conducted on the Ego4D, EPIC-Kitchens, and EGTEA datasets. The results consistently show that EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.

EgoPrompt: Prompt Learning for Egocentric Action Recognition

TL;DR

<3-5 sentence high-level summary>: EgoPrompt tackles egocentric action recognition by modeling the semantic interplay between verbs and nouns through a Unified Prompt Pool and a Diverse Pool Criteria. It introduces Implicit Pattern Interaction Learning to fuse component-specific prompts and a two-stage training strategy to learn robust cross-component patterns. The approach yields state-of-the-art generalization on Ego4D, Epic-Kitchens, and EGTEA, across within-dataset, cross-dataset, and base-to-novel settings, while providing ablations that highlight the importance of prompt diversity and interaction. This work advances prompt-learning-based generalization in first-person HOI understanding and offers a scalable framework for robust EAR in real-world shifts.

Abstract

Driven by the increasing demand for applications in augmented and virtual reality, egocentric action recognition has emerged as a prominent research area. It is typically divided into two subtasks: recognizing the performed behavior (i.e., verb component) and identifying the objects being acted upon (i.e., noun component) from the first-person perspective. However, most existing approaches treat these two components as independent classification tasks, focusing on extracting component-specific knowledge while overlooking their inherent semantic and contextual relationships, leading to fragmented representations and sub-optimal generalization capability. To address these challenges, we propose a prompt learning-based framework, EgoPrompt, to conduct the egocentric action recognition task. Building on the existing prompting strategy to capture the component-specific knowledge, we construct a Unified Prompt Pool space to establish interaction between the two types of component representations. Specifically, the component representations (from verbs and nouns) are first decomposed into fine-grained patterns with the prompt pair form. Then, these pattern-level representations are fused through an attention-based mechanism to facilitate cross-component interaction. To ensure the prompt pool is informative, we further introduce a novel training objective, Diverse Pool Criteria. This objective realizes our goals from two perspectives: Prompt Selection Frequency Regularization and Prompt Knowledge Orthogonalization. Extensive experiments are conducted on the Ego4D, EPIC-Kitchens, and EGTEA datasets. The results consistently show that EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.

Paper Structure

This paper contains 18 sections, 8 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison with existing framework. (a) Conventional methods use the component label of verb/noun to fine-tune its corresponding encoder independently. (b) Our proposed EgoPrompt constructs a Unified Prompt Pool with Diverse Pool Criteria constraint, which decomposes component-specific representation into implicit prompt pair patterns and achieves better knowledge interaction with an attention fusion mechanism.
  • Figure 2: Overall framework of EgoPrompt. Building upon the Baseline work, EgoPrompt further establishes the semantic interaction between components. Specifically, under the guidance of the Diverse Pool Criteria, EgoPrompt constructs the Unified Pattern Space with the Unified Prompt Pool design. It decomposes the component-specific representation into fine-grained patterns and integrates the selected prompt pairs into a fused representation.
  • Figure 3: Comparison of the generalization performance. The sub-caption denotes the training and testing dataset in this cross-dataset generalization setting
  • Figure 4: Adaptability of EgoPrompt on different backbones. The above results are collected from cross-dataset generalization, "From E4D to EK" setting.
  • Figure 5: Qualitative examples of EgoPrompt’s improvements. (a) Noun correction: The baseline (X-MIC) incorrectly grounds the object to "bag," while EgoPrompt leverages verb-centric semantics (e.g., deformable nature of "pour out") to infer the correct noun "flour." (b) Verb correction: The baseline misinterprets the action as "hold sleeve." EgoPrompt captures the state change of the noun (i.e., sleeve being cut), which contradicts the static nature of "hold" and corrects the verb to "cut." Green and red highlights indicate correct and incorrect predictions, respectively.