Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Sanjoy Kundu; Shubham Trehan; Sathyanarayanan N. Aakur

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Sanjoy Kundu, Shubham Trehan, Sathyanarayanan N. Aakur

TL;DR

The paper tackles open-world egocentric activity recognition where target labels are unknown, limiting traditional zero-shot approaches. It introduces ALGO, a neuro-symbolic framework that grounds objects via evidence-based prompting using CLIP and ConceptNet, then discovers plausible actions with an energy-based pattern-theory formulation and grounds actions visually through a learned mapping to ConceptNet embeddings. Extensive experiments on EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego demonstrate strong open-world and zero-shot performance, outperforming several baselines and showing meaningful generalization to unseen vocabularies. By reducing reliance on large labeled datasets and leveraging symbolic knowledge for reasoning, ALGO offers a scalable approach to grounded action understanding with potential applications in autonomous perception and open-ended video understanding.

Abstract

Learning to infer labels in an open world, i.e., in an environment where the target ``labels'' are unknown, is an important characteristic for achieving autonomy. Foundation models, pre-trained on enormous amounts of data, have shown remarkable generalization skills through prompting, particularly in zero-shot inference. However, their performance is restricted to the correctness of the target label's search space, i.e., candidate labels provided in the prompt. This target search space can be unknown or exceptionally large in an open world, severely restricting their performance. To tackle this challenging problem, we propose a two-step, neuro-symbolic framework called ALGO - Action Learning with Grounded Object recognition that uses symbolic knowledge stored in large-scale knowledge bases to infer activities in egocentric videos with limited supervision. First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video through evidence-based reasoning. Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework and learn to ground knowledge-based action (verb) concepts in the video. Extensive experiments on four publicly available datasets (EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego) demonstrate its performance on open-world activity inference. We also show that ALGO can be extended to zero-shot inference and demonstrate its competitive performance on the Charades-Ego dataset.

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

TL;DR

Abstract

Paper Structure (12 sections, 3 equations, 4 figures, 4 tables)

This paper contains 12 sections, 3 equations, 4 figures, 4 tables.

Introduction
Related Works
Proposed Framework: ALGO
Evidence-based Object Grounding with Prior-driven Prompting
Object-driven Activity Discovery
Visual-Semantic Action Grounding
Experimental Evaluation
Open World Activity Recognition
Extension to Zero-Shot Egocentric Activity Recognition
Ablation Studies
Generalization of Learned Actions to Unknown Vocabulary
Discussion, Limitations, and Future Work

Figures (4)

Figure 1: Overall architecture of the proposed approach (ALGO) is illustrated here. Using a two-step process, we first ground the objects within a gaze-driven ROI using CLIP radford2021learning as a noisy oracle before reasoning over the plausible activities performed in the video. The inferred activity and action (verb) are grounded in prior knowledge and visual features to refine the activity interpretations.
Figure 2: Ablation studies showing the impact of (a) the quality of object grounding techniques, (b) posterior-based action refinement, (c) iterative action refinement on generalization capabilities, and (d) the choice of visual and semantic representations.
Figure 3: Visualization of alternative concepts that were tested for grounding concepts in the video such as (a) fork, (b) knife, (c) table, (d) pepperoni, (e) biscuit, and (f) chocolate. These are automatically derived from ConceptNet and have semantic assertions quantifying how they are related.
Figure 4: Visualization of final interpretations for videos containing the activity (a) cut fork (top interpretation), (b) cut fork (second best interpretation), (c) pour honey, (d) pour ketchup, (e) mix ketchup, and (f) mix bowl. These are automatically derived from ConceptNet and have semantic assertions quantifying how they are related.

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

TL;DR

Abstract

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)