Table of Contents
Fetching ...

OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos

Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Andrew Zisserman

TL;DR

This paper introduces OVR, a large open-vocabulary dataset for temporal repetition counting in videos, aggregating Ego4D and Kinetics to provide diverse exo- and ego-centric perspectives with free-form descriptions and precise repetition intervals. It proposes OVRCounter, a transformer-based counting model with a video resampler and AdaLN-conditioned counter that supports both class-agnostic and text-conditioned counting, trained via a density-based loss and a text-video contrastive loss. Empirical results show OVRCounter markedly improves counting accuracy and repetition localization over prior models on the OVR dataset, while preserving performance when conditioned on text and showing robustness to some text mismatches. The dataset and model enable scalable, open-vocabulary temporal reasoning in video, with potential impact on fields from sports analytics to robotics and health monitoring.

Abstract

We introduce a dataset of annotations of temporal repetitions in videos. The dataset, OVR (pronounced as over), contains annotations for over 72K videos, with each annotation specifying the number of repetitions, the start and end time of the repetitions, and also a free-form description of what is repeating. The annotations are provided for videos sourced from Kinetics and Ego4D, and consequently cover both Exo and Ego viewing conditions, with a huge variety of actions and activities. Moreover, OVR is almost an order of magnitude larger than previous datasets for video repetition. We also propose a baseline transformer-based counting model, OVRCounter, that can localise and count repetitions in videos that are up to 320 frames long. The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count. The performance is also compared to a prior repetition counting model. The dataset is available for download at: https://sites.google.com/view/openvocabreps/

OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos

TL;DR

This paper introduces OVR, a large open-vocabulary dataset for temporal repetition counting in videos, aggregating Ego4D and Kinetics to provide diverse exo- and ego-centric perspectives with free-form descriptions and precise repetition intervals. It proposes OVRCounter, a transformer-based counting model with a video resampler and AdaLN-conditioned counter that supports both class-agnostic and text-conditioned counting, trained via a density-based loss and a text-video contrastive loss. Empirical results show OVRCounter markedly improves counting accuracy and repetition localization over prior models on the OVR dataset, while preserving performance when conditioned on text and showing robustness to some text mismatches. The dataset and model enable scalable, open-vocabulary temporal reasoning in video, with potential impact on fields from sports analytics to robotics and health monitoring.

Abstract

We introduce a dataset of annotations of temporal repetitions in videos. The dataset, OVR (pronounced as over), contains annotations for over 72K videos, with each annotation specifying the number of repetitions, the start and end time of the repetitions, and also a free-form description of what is repeating. The annotations are provided for videos sourced from Kinetics and Ego4D, and consequently cover both Exo and Ego viewing conditions, with a huge variety of actions and activities. Moreover, OVR is almost an order of magnitude larger than previous datasets for video repetition. We also propose a baseline transformer-based counting model, OVRCounter, that can localise and count repetitions in videos that are up to 320 frames long. The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count. The performance is also compared to a prior repetition counting model. The dataset is available for download at: https://sites.google.com/view/openvocabreps/
Paper Structure (12 sections, 1 equation, 8 figures, 4 tables)

This paper contains 12 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Samples from OVR dataset. Our data is annotated with open-vocabulary text descriptions enabling text-conditioned repetition counting. OVR dataset goes beyond sports and workout videos and introduces significant diversity and richness in the wild both from first-person (left) and third-person (right) views, at scale.
  • Figure 2: OVR-Ego4D. We display some example clips with free-form text descriptions in (a), the word-cloud visualisation of descriptions in (b), and repetition count and duration statistics in (c).
  • Figure 3: OVR-Kinetics. We display some example clips with free-form text descriptions in (a), the word-cloud visualisation of descriptions in (b), and repetition count and duration statistics in (c).
  • Figure 4: Dataset Curation. We explain how we construct OVR dataset in four stages.
  • Figure 5: OVRCounter Architecture. An input video goes through a spatio-temporal video encoder (bottom left) and then resampled through the Video Resampler module to generate per frame video tokens. These per frame video tokens run through the Conditional Counter to generate the per-frame densities which are used to compute the final count. The Conditional Counter is either conditioned with CLS token enabling class-agnostic counting or text token enabling text-conditioned counting.
  • ...and 3 more figures