Table of Contents
Fetching ...

GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition

Guangzhao Dai, Xiangbo Shu, Wenhao Wu, Rui Yan, Jiachao Zhang

TL;DR

This work targets Zero-Shot Egocentric Action Recognition (ZS-EAR) by addressing the misalignment between vision and language in traditional VLM-based approaches. It introduces GPT4Ego, a simple yet powerful framework that promotes fine-grained concept-description alignment through two modules: Ego-oriented Text Prompting (EgoTP) and Ego-oriented Visual Parsing (EgoVP). EgoTP expands class names into sentence-level contextual descriptions via chain-of-thought prompts using ChatGPT, while EgoVP uses SAM to parse frames into refined vision-contextual concepts. The approach yields state-of-the-art zero-shot performance on EK100, EGTEA, and CharadesEgo, with substantial gains over prior VLM methods, demonstrating the practical value of integrating LLMs and foundational vision models for egocentric video understanding.

Abstract

Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in Zero-Shot Egocentric Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this paper, we introduce GPT4Ego, a straightforward yet remarkably potent VLM framework for ZS-EAR, designed to enhance the fine-grained alignment of concept and description between vision and language. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%, +9.4%), EGTEA (39.6%, +5.5%), and CharadesEgo (31.5%, +2.6%).

GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition

TL;DR

This work targets Zero-Shot Egocentric Action Recognition (ZS-EAR) by addressing the misalignment between vision and language in traditional VLM-based approaches. It introduces GPT4Ego, a simple yet powerful framework that promotes fine-grained concept-description alignment through two modules: Ego-oriented Text Prompting (EgoTP) and Ego-oriented Visual Parsing (EgoVP). EgoTP expands class names into sentence-level contextual descriptions via chain-of-thought prompts using ChatGPT, while EgoVP uses SAM to parse frames into refined vision-contextual concepts. The approach yields state-of-the-art zero-shot performance on EK100, EGTEA, and CharadesEgo, with substantial gains over prior VLM methods, demonstrating the practical value of integrating LLMs and foundational vision models for egocentric video understanding.

Abstract

Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in Zero-Shot Egocentric Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this paper, we introduce GPT4Ego, a straightforward yet remarkably potent VLM framework for ZS-EAR, designed to enhance the fine-grained alignment of concept and description between vision and language. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%, +9.4%), EGTEA (39.6%, +5.5%), and CharadesEgo (31.5%, +2.6%).
Paper Structure (17 sections, 5 equations, 3 figures, 3 tables)

This paper contains 17 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Our GPT4Ego gains significant performance compared with SOTAs in Zero-Shot Egocentric Action Recognition (ZS-EAR) task, by prompting more visual concepts and textual descriptions as the contextual semantics.
  • Figure 2: Illustration of pre-trained vision-language models (VLMs) for zero-shot egocentric action recognition (ZS-EAR). (a) The previous VLMs treat ZS-EAR as a coarse-grained global video-text matching task, resulting in poor semantic alignment. (b) The main insights of our proposed GPT4Ego are to answer the limitations (i.e., almost treat ZS-EAR as a coarse-grained global video-text matching task) of previous works, in a two-fold way, i.e., prompting more action-related textual descriptions and prompting more action-related visual concepts by using open resources large language/vision foundation models (e.g., ChatGPT and SAM). (c) The new paradigm of our GPT4Ego after rethinking the task of ZS-EAR, which integrates SAM and GPT into VLMs for promoting the fine-grained semantic alignment between vision and language for ZS-EAR .
  • Figure 3: Visualization of textual descriptions and visual concepts generated by GPT4Ego. For a better view, the fine-grained description-concept alignment in vision and language is explicitly highlighted with multi-color lines. For comparison, we report the predictions of the top-5 class labels obtained by GPT4Ego and the general method.