OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Andrew Zisserman
TL;DR
This paper introduces OVR, a large open-vocabulary dataset for temporal repetition counting in videos, aggregating Ego4D and Kinetics to provide diverse exo- and ego-centric perspectives with free-form descriptions and precise repetition intervals. It proposes OVRCounter, a transformer-based counting model with a video resampler and AdaLN-conditioned counter that supports both class-agnostic and text-conditioned counting, trained via a density-based loss and a text-video contrastive loss. Empirical results show OVRCounter markedly improves counting accuracy and repetition localization over prior models on the OVR dataset, while preserving performance when conditioned on text and showing robustness to some text mismatches. The dataset and model enable scalable, open-vocabulary temporal reasoning in video, with potential impact on fields from sports analytics to robotics and health monitoring.
Abstract
We introduce a dataset of annotations of temporal repetitions in videos. The dataset, OVR (pronounced as over), contains annotations for over 72K videos, with each annotation specifying the number of repetitions, the start and end time of the repetitions, and also a free-form description of what is repeating. The annotations are provided for videos sourced from Kinetics and Ego4D, and consequently cover both Exo and Ego viewing conditions, with a huge variety of actions and activities. Moreover, OVR is almost an order of magnitude larger than previous datasets for video repetition. We also propose a baseline transformer-based counting model, OVRCounter, that can localise and count repetitions in videos that are up to 320 frames long. The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count. The performance is also compared to a prior repetition counting model. The dataset is available for download at: https://sites.google.com/view/openvocabreps/
