Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning
Sanjoy Kundu, Shubham Trehan, Sathyanarayanan N. Aakur
TL;DR
The paper tackles open-world egocentric activity recognition where target labels are unknown, limiting traditional zero-shot approaches. It introduces ALGO, a neuro-symbolic framework that grounds objects via evidence-based prompting using CLIP and ConceptNet, then discovers plausible actions with an energy-based pattern-theory formulation and grounds actions visually through a learned mapping to ConceptNet embeddings. Extensive experiments on EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego demonstrate strong open-world and zero-shot performance, outperforming several baselines and showing meaningful generalization to unseen vocabularies. By reducing reliance on large labeled datasets and leveraging symbolic knowledge for reasoning, ALGO offers a scalable approach to grounded action understanding with potential applications in autonomous perception and open-ended video understanding.
Abstract
Learning to infer labels in an open world, i.e., in an environment where the target ``labels'' are unknown, is an important characteristic for achieving autonomy. Foundation models, pre-trained on enormous amounts of data, have shown remarkable generalization skills through prompting, particularly in zero-shot inference. However, their performance is restricted to the correctness of the target label's search space, i.e., candidate labels provided in the prompt. This target search space can be unknown or exceptionally large in an open world, severely restricting their performance. To tackle this challenging problem, we propose a two-step, neuro-symbolic framework called ALGO - Action Learning with Grounded Object recognition that uses symbolic knowledge stored in large-scale knowledge bases to infer activities in egocentric videos with limited supervision. First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video through evidence-based reasoning. Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework and learn to ground knowledge-based action (verb) concepts in the video. Extensive experiments on four publicly available datasets (EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego) demonstrate its performance on open-world activity inference. We also show that ALGO can be extended to zero-shot inference and demonstrate its competitive performance on the Charades-Ego dataset.
