Table of Contents
Fetching ...

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition

Mohammadreza Salehi, Jae Sung Park, Tanush Yadav, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi

TL;DR

The findings with state-of-the-art models indicate that having a high frame sampling rate is important for accurately recognizing actions in ActionAtlas, a feature that some leading proprietary video models, such as Gemini, do not include in their default configuration.

Abstract

Our world is full of varied actions and moves across specialized domains that we, as humans, strive to identify and understand. Within any single domain, actions can often appear quite similar, making it challenging for deep models to distinguish them accurately. To evaluate the effectiveness of multimodal foundation models in helping us recognize such actions, we present ActionAtlas v1.0, a multiple-choice video question answering benchmark featuring short videos across various sports. Each video in the dataset is paired with a question and four or five choices. The question pinpoints specific individuals, asking which choice "best" describes their action within a certain temporal context. Overall, the dataset includes 934 videos showcasing 580 unique actions across 56 sports, with a total of 1896 actions within choices. Unlike most existing video question answering benchmarks that only cover simplistic actions, often identifiable from a single frame, ActionAtlas focuses on intricate movements and rigorously tests the model's capability to discern subtle differences between moves that look similar within each domain. We evaluate open and proprietary foundation models on this benchmark, finding that the best model, GPT-4o, achieves a maximum accuracy of 45.52%. Meanwhile, Non-expert crowd workers, provided with action description for each choice, achieve 61.64% accuracy, where random chance is approximately 21%. Our findings with state-of-the-art models indicate that having a high frame sampling rate is important for accurately recognizing actions in ActionAtlas, a feature that some leading proprietary video models, such as Gemini, do not include in their default configuration.

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition

TL;DR

The findings with state-of-the-art models indicate that having a high frame sampling rate is important for accurately recognizing actions in ActionAtlas, a feature that some leading proprietary video models, such as Gemini, do not include in their default configuration.

Abstract

Our world is full of varied actions and moves across specialized domains that we, as humans, strive to identify and understand. Within any single domain, actions can often appear quite similar, making it challenging for deep models to distinguish them accurately. To evaluate the effectiveness of multimodal foundation models in helping us recognize such actions, we present ActionAtlas v1.0, a multiple-choice video question answering benchmark featuring short videos across various sports. Each video in the dataset is paired with a question and four or five choices. The question pinpoints specific individuals, asking which choice "best" describes their action within a certain temporal context. Overall, the dataset includes 934 videos showcasing 580 unique actions across 56 sports, with a total of 1896 actions within choices. Unlike most existing video question answering benchmarks that only cover simplistic actions, often identifiable from a single frame, ActionAtlas focuses on intricate movements and rigorously tests the model's capability to discern subtle differences between moves that look similar within each domain. We evaluate open and proprietary foundation models on this benchmark, finding that the best model, GPT-4o, achieves a maximum accuracy of 45.52%. Meanwhile, Non-expert crowd workers, provided with action description for each choice, achieve 61.64% accuracy, where random chance is approximately 21%. Our findings with state-of-the-art models indicate that having a high frame sampling rate is important for accurately recognizing actions in ActionAtlas, a feature that some leading proprietary video models, such as Gemini, do not include in their default configuration.
Paper Structure (53 sections, 16 figures, 8 tables)

This paper contains 53 sections, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Examples from ActionAtlas. To answer ActionAtlas's questions, models have to be able to recognize fine movements and nuances that differentiate actions belonging to the same domain (examples 2, 3, 4, 5, 6 from top), correctly localize and track the individual performing the action if there are many (example 1). [Video links from top to bottom: https://youtu.be/yCJEDQw7Jqw, https://youtu.be/q4F15uiDWUQ, https://youtu.be/DOWWKdDiTXo, https://youtu.be/WIB3Z-1Mv2o, https://youtu.be/kx6ybh4SxJU, https://youtu.be/EUXIUfSWpXw].
  • Figure 2: Data collection pipeline consisting of Automatic and Manual parts. First a comprehensive list of actions is compiled (§\ref{['subsection_compile_action_list']}) which are then used for searching metadata of videos on YouTube relevant to each action (§\ref{['subsection_meta']}). Then with lexical search a subset of videos are selected (§\ref{['subsection_lexical_search']}). If a video is shorter than 30 seconds, it will be used in crowd-sourcing. Otherwise, the video is transcribed, and GPT4 selects potential 30 second segments that contain the actions based on the transcription (§ \ref{['subsection:whisper_gpt4_localization']}). Mechanical Turkers will then verify the presence of actions in the segments (§\ref{['subsection:manual_verification']}) and localize it (§\ref{['subsection:manual_localization']}). If all videos of an action were rejected, we repeat the process to source new videos. Finally, GPT4 generates Multiple-choice QAs which are checked by the authors (§\ref{['subsection:qa_generation_manual_refine']}).
  • Figure 3: Given transcription of a long video, GPT4-Text can be prompted to output timestamps where the action is likely to occur, without having access to frames. The model has found instances where the speaker comments on the great quality of the action (top) or indicates that a demonstration of the action is going to happen shortly (bottom). More details in §\ref{['subsection:whisper_gpt4_localization']}. [Video links: https://www.youtube.com/watch?v=XEAihB16L_g&t=59s, https://www.youtube.com/watch?v=9_xQXignjEQ&t=45s].
  • Figure 4: Definition of Half-Volley Drop Shot generated by GPT4-text to be used by crowd-workers for validation and localization. The workers match the key elements listed in the definition with what they see in the video to identify if the action happens. For more details see §\ref{['subsection:manual_verification']}.
  • Figure 4: Improvements from Chain-of-thought reasoning are not statistically significant on ActionAtlas. See §\ref{['subsection:ablation']} fore more details on the setup.
  • ...and 11 more figures