Table of Contents
Fetching ...

The AVA-Kinetics Localized Human Actions Video Dataset

Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman

TL;DR

AVA-Kinetics combines AVA's dense, per-person action localization with Kinetics-700's diverse clip pool by annotating a single key-frame per Kinetics video with AVA-style boxes and labels. The dataset enables robust benchmarking of action localization models, demonstrated by improvements in Video Action Transformer performance when trained on the integrated data, both with ground-truth and detector-proposed boxes. Key analyses include NPMI-based cross-dataset class correlations, per-class gains, and data-size effects, underscoring the value of cross-dataset pretraining and multi-task potential. Overall, AVA-Kinetics broadens visual diversity while preserving precise localization, supporting more generalizable action recognition in videos.

Abstract

This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe the annotation process and provide statistics about the new dataset. We also include a baseline evaluation using the Video Action Transformer Network on the AVA-Kinetics dataset, demonstrating improved performance for action classification on the AVA test set. The dataset can be downloaded from https://research.google.com/ava/

The AVA-Kinetics Localized Human Actions Video Dataset

TL;DR

AVA-Kinetics combines AVA's dense, per-person action localization with Kinetics-700's diverse clip pool by annotating a single key-frame per Kinetics video with AVA-style boxes and labels. The dataset enables robust benchmarking of action localization models, demonstrated by improvements in Video Action Transformer performance when trained on the integrated data, both with ground-truth and detector-proposed boxes. Key analyses include NPMI-based cross-dataset class correlations, per-class gains, and data-size effects, underscoring the value of cross-dataset pretraining and multi-task potential. Overall, AVA-Kinetics broadens visual diversity while preserving precise localization, supporting more generalizable action recognition in videos.

Abstract

This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe the annotation process and provide statistics about the new dataset. We also include a baseline evaluation using the Video Action Transformer Network on the AVA-Kinetics dataset, demonstrating improved performance for action classification on the AVA test set. The dataset can be downloaded from https://research.google.com/ava/

Paper Structure

This paper contains 17 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: An example key-frame in the AVA-Kinetics dataset. The annotation contains AVA-style bounding boxes and their corresponding AVA labels. The key-frame is part of a Kinetics clip annotated with a clip-level label "high jump".
  • Figure 2: Number of samples per AVA class comparing the AVA train set (blue) and the Kinetics train set (green). The stacked bar shows the distribution of the AVA-Kinetics train set. The distribution is long-tailed. The tail part is zoomed in on the top right.
  • Figure 3: Number of unique videos per AVA class comparing the AVA train set (blue) and the Kinetics train set (green). The stacked bar shows the distribution of the AVA-Kinetics train set. The y-axis is in log-scale. The two datasets have different distribution in terms of videos. The AVA data has long videos while Kinetics has short video clips, hence the unique videos per class in AVA are much fewer.
  • Figure 4: Number of bounding boxes per frame comparing the AVA train set (blue) and the Kinetics train set (green). The stacked bar shows the distribution of the AVA-Kinetics train set. There is a substantial number of frames with no person detected. The majority of the key-frames contains only one bounding box.
  • Figure 5: The distribution of the area of person bounding boxes in the AVA train set (blue) and the Kinetics train set (green). The area is normalized according to a 1x1 square. The peak area in AVA is around 0.02 while the peak in Kinetics is around 0.01. It is observed that Kinetics dataset produces significantly more smaller person bounding boxes.
  • ...and 3 more figures