Table of Contents
Fetching ...

About Time: Advances, Challenges, and Outlooks of Action Understanding

Alexandros Stergiou, Ronald Poppe

TL;DR

This survey addresses the broad problem of video action understanding by organizing tasks into three temporal scopes: recognition, prediction, and forecasting. It surveys modeling approaches that separate visual and temporal information versus jointly encoding space-time, catalogs extensive general and domain-specific datasets, and examines recognition, predictive, and forecasting tasks across multimodal settings. Key contributions include a comprehensive taxonomy, synthesis of methodological trends (including vision-language models and self-supervised learning), and a forward-looking discussion of challenges such as efficiency, reasoning semantics, and robust cross-modal alignment. The work highlights how advances in multimodal and generative techniques can drive practical, real-time, and privacy-conscious action understanding in diverse applications. Overall, it provides a structured roadmap for researchers to navigate the rapidly evolving landscape and to develop unified, scalable, and semantically aware action understanding systems.

Abstract

We have witnessed impressive advances in video action understanding. Increased dataset sizes, variability, and computation availability have enabled leaps in performance and task diversification. Current systems can provide coarse- and fine-grained descriptions of video scenes, extract segments corresponding to queries, synthesize unobserved parts of videos, and predict context across multiple modalities. This survey comprehensively reviews advances in uni- and multi-modal action understanding across a range of tasks. We focus on prevalent challenges, overview widely adopted datasets, and survey seminal works with an emphasis on recent advances. We broadly distinguish between three temporal scopes: (1) recognition tasks of actions observed in full, (2) prediction tasks for ongoing partially observed actions, and (3) forecasting tasks for subsequent unobserved action(s). This division allows us to identify specific action modeling and video representation challenges. Finally, we outline future directions to address current shortcomings.

About Time: Advances, Challenges, and Outlooks of Action Understanding

TL;DR

This survey addresses the broad problem of video action understanding by organizing tasks into three temporal scopes: recognition, prediction, and forecasting. It surveys modeling approaches that separate visual and temporal information versus jointly encoding space-time, catalogs extensive general and domain-specific datasets, and examines recognition, predictive, and forecasting tasks across multimodal settings. Key contributions include a comprehensive taxonomy, synthesis of methodological trends (including vision-language models and self-supervised learning), and a forward-looking discussion of challenges such as efficiency, reasoning semantics, and robust cross-modal alignment. The work highlights how advances in multimodal and generative techniques can drive practical, real-time, and privacy-conscious action understanding in diverse applications. Overall, it provides a structured roadmap for researchers to navigate the rapidly evolving landscape and to develop unified, scalable, and semantically aware action understanding systems.

Abstract

We have witnessed impressive advances in video action understanding. Increased dataset sizes, variability, and computation availability have enabled leaps in performance and task diversification. Current systems can provide coarse- and fine-grained descriptions of video scenes, extract segments corresponding to queries, synthesize unobserved parts of videos, and predict context across multiple modalities. This survey comprehensively reviews advances in uni- and multi-modal action understanding across a range of tasks. We focus on prevalent challenges, overview widely adopted datasets, and survey seminal works with an emphasis on recent advances. We broadly distinguish between three temporal scopes: (1) recognition tasks of actions observed in full, (2) prediction tasks for ongoing partially observed actions, and (3) forecasting tasks for subsequent unobserved action(s). This division allows us to identify specific action modeling and video representation challenges. Finally, we outline future directions to address current shortcomings.

Paper Structure

This paper contains 72 sections, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Action understanding historical overview. We present popular tasks over time. Landmark papers are selected by their relevance to the period's trends. Most tasks remain popular today.
  • Figure 2: Action understanding tasks. The progress of the video is indicated by the top bar. From the currently performed action of total duration $\tau_1$, only the $\tau_{1,\rho}<\tau_1$ part is readily observable. After a transition period $0\leq\tau_{1 \rightarrow 2}$, another action is performed with duration $\tau_2$. Action recognition tasks consider full observations of the action at $\tau_1$. Action prediction uses only part $\tau_{1,\rho}$ of the ongoing action. Action forecasting uses current action at $\tau_1$ to predict future actions. Video example sourced from wang2019vatex.
  • Figure 3: Datasets compared by total dataset duration and primary modality. Circle sizes correspond to the (approximate) summed duration of all videos in the datasets. Recent datasets (i.e., $>80$) have longer total running times and include additional modalities such as language or audio.
  • Figure 4: Redundancy reduction methods include (a) selection of task-specific salient frames, (b) use of supplementary modalities such as audio to preview relevant regions to sample from, (c) input permutations to compress irrelevant frames and segments, and (d) using embeddings from a teacher model as targets.
  • Figure 5: Visualization of temporal-based tasks. (a) Temporal Action Localization (TAL) discovers the start and end times of individual actions. In contrast, (b) Spatio-Temporal Action Detection (STAD) is more complex as it requires temporally and spatially localizing actions with bounding boxes for actors and objects over time. Distinctively, (c) Video Repetition Counting (VRC) is not based on action labels and instead requires counting repetitions of actions or motions in an open-set setting. Video source from kay2017kinetics.
  • ...and 12 more figures