Table of Contents
Fetching ...

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Sanjoy Kundu, Shubham Trehan, Sathyanarayanan N. Aakur

TL;DR

The paper tackles open-world egocentric activity recognition where target labels are unknown, limiting traditional zero-shot approaches. It introduces ALGO, a neuro-symbolic framework that grounds objects via evidence-based prompting using CLIP and ConceptNet, then discovers plausible actions with an energy-based pattern-theory formulation and grounds actions visually through a learned mapping to ConceptNet embeddings. Extensive experiments on EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego demonstrate strong open-world and zero-shot performance, outperforming several baselines and showing meaningful generalization to unseen vocabularies. By reducing reliance on large labeled datasets and leveraging symbolic knowledge for reasoning, ALGO offers a scalable approach to grounded action understanding with potential applications in autonomous perception and open-ended video understanding.

Abstract

Learning to infer labels in an open world, i.e., in an environment where the target ``labels'' are unknown, is an important characteristic for achieving autonomy. Foundation models, pre-trained on enormous amounts of data, have shown remarkable generalization skills through prompting, particularly in zero-shot inference. However, their performance is restricted to the correctness of the target label's search space, i.e., candidate labels provided in the prompt. This target search space can be unknown or exceptionally large in an open world, severely restricting their performance. To tackle this challenging problem, we propose a two-step, neuro-symbolic framework called ALGO - Action Learning with Grounded Object recognition that uses symbolic knowledge stored in large-scale knowledge bases to infer activities in egocentric videos with limited supervision. First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video through evidence-based reasoning. Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework and learn to ground knowledge-based action (verb) concepts in the video. Extensive experiments on four publicly available datasets (EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego) demonstrate its performance on open-world activity inference. We also show that ALGO can be extended to zero-shot inference and demonstrate its competitive performance on the Charades-Ego dataset.

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

TL;DR

The paper tackles open-world egocentric activity recognition where target labels are unknown, limiting traditional zero-shot approaches. It introduces ALGO, a neuro-symbolic framework that grounds objects via evidence-based prompting using CLIP and ConceptNet, then discovers plausible actions with an energy-based pattern-theory formulation and grounds actions visually through a learned mapping to ConceptNet embeddings. Extensive experiments on EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego demonstrate strong open-world and zero-shot performance, outperforming several baselines and showing meaningful generalization to unseen vocabularies. By reducing reliance on large labeled datasets and leveraging symbolic knowledge for reasoning, ALGO offers a scalable approach to grounded action understanding with potential applications in autonomous perception and open-ended video understanding.

Abstract

Learning to infer labels in an open world, i.e., in an environment where the target ``labels'' are unknown, is an important characteristic for achieving autonomy. Foundation models, pre-trained on enormous amounts of data, have shown remarkable generalization skills through prompting, particularly in zero-shot inference. However, their performance is restricted to the correctness of the target label's search space, i.e., candidate labels provided in the prompt. This target search space can be unknown or exceptionally large in an open world, severely restricting their performance. To tackle this challenging problem, we propose a two-step, neuro-symbolic framework called ALGO - Action Learning with Grounded Object recognition that uses symbolic knowledge stored in large-scale knowledge bases to infer activities in egocentric videos with limited supervision. First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video through evidence-based reasoning. Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework and learn to ground knowledge-based action (verb) concepts in the video. Extensive experiments on four publicly available datasets (EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego) demonstrate its performance on open-world activity inference. We also show that ALGO can be extended to zero-shot inference and demonstrate its competitive performance on the Charades-Ego dataset.
Paper Structure (12 sections, 3 equations, 4 figures, 4 tables)

This paper contains 12 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overall architecture of the proposed approach (ALGO) is illustrated here. Using a two-step process, we first ground the objects within a gaze-driven ROI using CLIP radford2021learning as a noisy oracle before reasoning over the plausible activities performed in the video. The inferred activity and action (verb) are grounded in prior knowledge and visual features to refine the activity interpretations.
  • Figure 2: Ablation studies showing the impact of (a) the quality of object grounding techniques, (b) posterior-based action refinement, (c) iterative action refinement on generalization capabilities, and (d) the choice of visual and semantic representations.
  • Figure 3: Visualization of alternative concepts that were tested for grounding concepts in the video such as (a) fork, (b) knife, (c) table, (d) pepperoni, (e) biscuit, and (f) chocolate. These are automatically derived from ConceptNet and have semantic assertions quantifying how they are related.
  • Figure 4: Visualization of final interpretations for videos containing the activity (a) cut fork (top interpretation), (b) cut fork (second best interpretation), (c) pour honey, (d) pour ketchup, (e) mix ketchup, and (f) mix bowl. These are automatically derived from ConceptNet and have semantic assertions quantifying how they are related.