Table of Contents
Fetching ...

Object Aware Egocentric Online Action Detection

Joungbin An, Yunsu Park, Hyolim Kang, Seon Joo Kim

TL;DR

This work tackles Online Action Detection in egocentric video by addressing the gap that exocentric-trained models overlook first-person object-centric cues. It introduces an Object-Aware Module that extracts object presence via Faster-RCNN, then uses a two-transformer architecture with learnable queries to fuse object information and temporal cues, culminating in verb, noun, and action classification. Across Epic-Kitchens-100 and multiple strong OAD baselines, the module yields consistent improvements, particularly in Verb accuracy, while remaining lightweight and easily integrable. The approach advances real-time egocentric video understanding by grounding actions in object interactions, with potential for broader deployment in AR and assistive technologies.

Abstract

Advancements in egocentric video datasets like Ego4D, EPIC-Kitchens, and Ego-Exo4D have enriched the study of first-person human interactions, which is crucial for applications in augmented reality and assisted living. Despite these advancements, current Online Action Detection methods, which efficiently detect actions in streaming videos, are predominantly designed for exocentric views and thus fail to capitalize on the unique perspectives inherent to egocentric videos. To address this gap, we introduce an Object-Aware Module that integrates egocentric-specific priors into existing OAD frameworks, enhancing first-person footage interpretation. Utilizing object-specific details and temporal dynamics, our module improves scene understanding in detecting actions. Validated extensively on the Epic-Kitchens 100 dataset, our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements, marking an important step forward in adapting action detection systems to egocentric video analysis.

Object Aware Egocentric Online Action Detection

TL;DR

This work tackles Online Action Detection in egocentric video by addressing the gap that exocentric-trained models overlook first-person object-centric cues. It introduces an Object-Aware Module that extracts object presence via Faster-RCNN, then uses a two-transformer architecture with learnable queries to fuse object information and temporal cues, culminating in verb, noun, and action classification. Across Epic-Kitchens-100 and multiple strong OAD baselines, the module yields consistent improvements, particularly in Verb accuracy, while remaining lightweight and easily integrable. The approach advances real-time egocentric video understanding by grounding actions in object interactions, with potential for broader deployment in AR and assistive technologies.

Abstract

Advancements in egocentric video datasets like Ego4D, EPIC-Kitchens, and Ego-Exo4D have enriched the study of first-person human interactions, which is crucial for applications in augmented reality and assisted living. Despite these advancements, current Online Action Detection methods, which efficiently detect actions in streaming videos, are predominantly designed for exocentric views and thus fail to capitalize on the unique perspectives inherent to egocentric videos. To address this gap, we introduce an Object-Aware Module that integrates egocentric-specific priors into existing OAD frameworks, enhancing first-person footage interpretation. Utilizing object-specific details and temporal dynamics, our module improves scene understanding in detecting actions. Validated extensively on the Epic-Kitchens 100 dataset, our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements, marking an important step forward in adapting action detection systems to egocentric video analysis.
Paper Structure (14 sections, 2 figures, 2 tables)

This paper contains 14 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of our methodology.
  • Figure 2: Object-Aware Module.