Table of Contents
Fetching ...

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

Qi Zhao, Shijie Wang, Ce Zhang, Changcheng Fu, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun

TL;DR

AntGPT investigates whether large language models can aid long-term action anticipation from video by leveraging their procedural priors for both goal inference and planning. The framework links a video action recognizer to LLMs in a two-stage pipeline, enabling in-context goal inference and either bottom-up or top-down action prediction, with further exploration of temporal modeling and distillation. Empirically, AntGPT achieves state-of-the-art results on Ego4D LTA benchmarks and EK-55/EGTEA, with notable gains on rare actions and an efficient 91M-parameter distilled model. The work highlights the utility of language priors for video understanding while acknowledging limitations in fixed-length action representations and prompt design, pointing to future work on multiple plausible goals and richer video representations.

Abstract

Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

TL;DR

AntGPT investigates whether large language models can aid long-term action anticipation from video by leveraging their procedural priors for both goal inference and planning. The framework links a video action recognizer to LLMs in a two-stage pipeline, enabling in-context goal inference and either bottom-up or top-down action prediction, with further exploration of temporal modeling and distillation. Empirically, AntGPT achieves state-of-the-art results on Ego4D LTA benchmarks and EK-55/EGTEA, with notable gains on rare actions and an efficient 91M-parameter distilled model. The work highlights the utility of language priors for video understanding while acknowledging limitations in fixed-length action representations and prompt design, pointing to future work on multiple plausible goals and richer video representations.

Abstract

Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT
Paper Structure (30 sections, 7 figures, 14 tables)

This paper contains 30 sections, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Illustration of AntGPT. (a) Overview of LTA pradigms. The bottom-up approach predicts future actions directly based on observed human activities, while the top-down approach is guided by high-level goals inferred from observations (hence allows procedure planning). (b) Actions as video representations. A pre-trained action recognition model $\mathcal{E}$ takes visual observations $V_o$ as inputs and generates action labels, which can be noisy (shown in red). (c) Goal inferred by an LLM. We provide few human-provided examples of action sequences and the expected high-level goals, and leverage an LLM $\mathcal{\pi}$ to infer the goal via in-context learning. (d) Knowledge Distillation. We distill a frozen LLM $\mathcal{\pi}_t$ into a compact student model $\mathcal{\pi}_s$ at sequence level. (e) Few-shot LTA by in-context learning (ICL), where the ICL prompts can be either bottom-up or top-down.
  • Figure 2: Examples of the goals inferred by LLMs. Goals are inferred from the recognized actions of the 8 observed segments. The future actions are ground truth for illustration purposes.
  • Figure A1: Illustration of goal prediction and LTA with LLMs: (a) High-level goal prediction wth in-context learning (ICL). (b) Few-shot bottom-up action prediction with ICL. (c) Top-down prediction with chain-of-thoughts (CoT). The green word indicates correctly recognized actions (inputs to the LLM) and future predictions (outputs of the LLM), red indicates incorrectly recognized or predicted actions. For this example, the ground-truth observations are [put paintbrush, adjust paintbrush, take container, dip container, paint wall, paint wall, dip wall, paint wall].
  • Figure A2: Illustrations of counterfactual prediction. We replace the originally inferred goal (gardening or trimming plants and fix machine) with an altered goal (harvesting crops and examine machine), and observe that the anticipated actions change accordingly, even with the same set of recognized actions as the inputs to the LLM. Words marked in red highlight the different predictions.
  • Figure A3: Four examples of results from fine-tuned AntGPT. The green word indicates correctly recognized actions (inputs to the LLM) and future predictions (outputs of the LLM), red indicates incorrectly recognized or predicted actions.
  • ...and 2 more figures