Table of Contents
Fetching ...

Human Action Anticipation: A Survey

Bolin Lai, Sam Toyer, Tushar Nagarajan, Rohit Girdhar, Shengxin Zha, James M. Rehg, Kris Kitani, Kristen Grauman, Ruta Desai, Miao Liu

TL;DR

This survey comprehensively organizes the action-anticipation literature into seven fine-grained tasks, detailing their input/output specifications, evaluation metrics, model biases, and data modalities. It synthesizes a broad spectrum of approaches—from classic probabilistic and RNN-based methods to transformer-based and multimodal architectures—while highlighting pretraining strategies and auxiliary objectives that improve forecast quality. The authors provide a thorough cross-dataset quantitative panorama, comparing methods on eleven benchmarks and outlining key gaps, such as error accumulation, long-horizon modeling, and the potential of foundation-model–driven approaches. The work also maps a path forward for egocentric and exocentric forecasting, advocating for richer multimodal fusion, language-integrated perception, and more nuanced evaluation standards to drive progress in real-world forecasting systems.

Abstract

Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior prediction spans various tasks, including action anticipation, activity forecasting, intent prediction, goal prediction, and so on. Our survey aims to tie together this fragmented literature, covering recent technical innovations as well as the development of new large-scale datasets for model training and evaluation. We also summarize the widely-used metrics for different tasks and provide a comprehensive performance comparison of existing approaches on eleven action anticipation datasets. This survey serves as not only a reference for contemporary methodologies in action anticipation, but also a guideline for future research direction of this evolving landscape.

Human Action Anticipation: A Survey

TL;DR

This survey comprehensively organizes the action-anticipation literature into seven fine-grained tasks, detailing their input/output specifications, evaluation metrics, model biases, and data modalities. It synthesizes a broad spectrum of approaches—from classic probabilistic and RNN-based methods to transformer-based and multimodal architectures—while highlighting pretraining strategies and auxiliary objectives that improve forecast quality. The authors provide a thorough cross-dataset quantitative panorama, comparing methods on eleven benchmarks and outlining key gaps, such as error accumulation, long-horizon modeling, and the potential of foundation-model–driven approaches. The work also maps a path forward for egocentric and exocentric forecasting, advocating for richer multimodal fusion, language-integrated perception, and more nuanced evaluation standards to drive progress in real-world forecasting systems.

Abstract

Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior prediction spans various tasks, including action anticipation, activity forecasting, intent prediction, goal prediction, and so on. Our survey aims to tie together this fragmented literature, covering recent technical innovations as well as the development of new large-scale datasets for model training and evaluation. We also summarize the widely-used metrics for different tasks and provide a comprehensive performance comparison of existing approaches on eleven action anticipation datasets. This survey serves as not only a reference for contemporary methodologies in action anticipation, but also a guideline for future research direction of this evolving landscape.

Paper Structure

This paper contains 59 sections, 9 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: A chronological overview of existing work in action anticipation. We only cover the approaches that make important technical breakthrough or have high impacts in this field. Please refer to Table \ref{['tab:action-methods']} and Table \ref{['tab:action-methods-2']} for a thorough review.
  • Figure 2: Taxonomy of action anticipation tasks. Each of the tasks covered in this paper involves predicting a different portion of the future action segment information (action labels, time or descriptions) for each annotated video. We break down the action anticipation problem into seven fine-grained tasks and show visual depiction of the input/output spec for each task. Corresponding notations are explained in Table \ref{['tab:notation']}, and each task is covered in detail in Section \ref{['sec:action-tasks']}.
  • Figure 3: An illustration of how the movements that precede a high-level action (a high five in this case) can be decomposed into a hierarchy of coarse, mid-level and fine-grained movemes. In this figure, the $x$-axis represents time, and the $y$-axis represents moveme granularity. The frame sequences in each colored box serve as examples of the corresponding moveme. Reproduced from Figure 2 of Lan et al. lan2014hierarchical.
  • Figure 4: Hyperbolic geometry can capture uncertainty over future video sequences suris2021learning. At left is a video sequence split into observed and unobserved segments. At right is a sketch of a hyperbolic space, represented by a circle. In this example, the model can infer that the future frame embeddings will be one of the black squares (), but is uncertain about which one. Averaging these possible embeddings produces a point $\hat{z}$ near the origin of the Poincaré ball. After observing more frames, the model will gain enough confidence to update its representation to the more specific value $z$ near the edge of the ball. Adapted from Figure 2 of Suris et al. suris2021learning.
  • Figure 5: Vondrick et al. vondrick2016anticipating propose future feature forecasting as an intermediate task in short-term action anticipation. Given a representation of the most recently observed frame $o_t$, their model predicts the representation $\phi(o_{t+1})$ of the target frame, then predicts an action from that representation. The advantage of this approach is that the frame forecasting model can be trained on a large amount of unlabeled data. Adapted from Figure 1 of Vondrick et al. vondrick2016anticipating.
  • ...and 4 more figures