GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition
Guangzhao Dai, Xiangbo Shu, Wenhao Wu, Rui Yan, Jiachao Zhang
TL;DR
This work targets Zero-Shot Egocentric Action Recognition (ZS-EAR) by addressing the misalignment between vision and language in traditional VLM-based approaches. It introduces GPT4Ego, a simple yet powerful framework that promotes fine-grained concept-description alignment through two modules: Ego-oriented Text Prompting (EgoTP) and Ego-oriented Visual Parsing (EgoVP). EgoTP expands class names into sentence-level contextual descriptions via chain-of-thought prompts using ChatGPT, while EgoVP uses SAM to parse frames into refined vision-contextual concepts. The approach yields state-of-the-art zero-shot performance on EK100, EGTEA, and CharadesEgo, with substantial gains over prior VLM methods, demonstrating the practical value of integrating LLMs and foundational vision models for egocentric video understanding.
Abstract
Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in Zero-Shot Egocentric Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this paper, we introduce GPT4Ego, a straightforward yet remarkably potent VLM framework for ZS-EAR, designed to enhance the fine-grained alignment of concept and description between vision and language. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%, +9.4%), EGTEA (39.6%, +5.5%), and CharadesEgo (31.5%, +2.6%).
