Table of Contents
Fetching ...

AirLetters: An Open Video Dataset of Characters Drawn in the Air

Rishit Dagli, Guillaume Berger, Joanna Materzynska, Ingo Bax, Roland Memisevic

TL;DR

AirLetters, a new video dataset consisting of real-world videos of human-generated, articulated motions, requires a vision model to predict letters that humans draw in the air, and shows that accurate representations of complex articulated motions remains an open problem for end-to-end learning.

Abstract

We introduce AirLetters, a new video dataset consisting of real-world videos of human-generated, articulated motions. Specifically, our dataset requires a vision model to predict letters that humans draw in the air. Unlike existing video datasets, accurate classification predictions for AirLetters rely critically on discerning motion patterns and on integrating long-range information in the video over time. An extensive evaluation of state-of-the-art image and video understanding models on AirLetters shows that these methods perform poorly and fall far behind a human baseline. Our work shows that, despite recent progress in end-to-end video understanding, accurate representations of complex articulated motions -- a task that is trivial for humans -- remains an open problem for end-to-end learning.

AirLetters: An Open Video Dataset of Characters Drawn in the Air

TL;DR

AirLetters, a new video dataset consisting of real-world videos of human-generated, articulated motions, requires a vision model to predict letters that humans draw in the air, and shows that accurate representations of complex articulated motions remains an open problem for end-to-end learning.

Abstract

We introduce AirLetters, a new video dataset consisting of real-world videos of human-generated, articulated motions. Specifically, our dataset requires a vision model to predict letters that humans draw in the air. Unlike existing video datasets, accurate classification predictions for AirLetters rely critically on discerning motion patterns and on integrating long-range information in the video over time. An extensive evaluation of state-of-the-art image and video understanding models on AirLetters shows that these methods perform poorly and fall far behind a human baseline. Our work shows that, despite recent progress in end-to-end video understanding, accurate representations of complex articulated motions -- a task that is trivial for humans -- remains an open problem for end-to-end learning.
Paper Structure (27 sections, 10 figures, 6 tables)

This paper contains 27 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview. We present AirLetters, a novel dataset comprised of video-label pairs of human hands denoting characters in the air. Our dataset contains videos denoting all the Latin letters and digits as well as two background classes, "Doing Other Things" and "Doing Nothing". Our dataset contains 161652 videos recorded by 1781 workers. We show the trajectory of the fingertips for visualization purposes.
  • Figure 2: Example Videos. Frames from randomly sampled videos from our dataset showing humans drawing characters as well as contrast classes.
  • Figure 3: Challenges due to inter-class similarities and intra-class diversity. We show some examples of drawing the letter "B" and the digit of "3", where differentiating both of these classes also requires understanding depth and velocity of relative motion to understand if the individual intended to draw a vertical line (for "B") or only meant to place their hands in position (for "3"). Underneath, we show examples of variability in drawing the letter "Y". For example, in one way version of drawing the letter "Y", only the last few frames show a stroke that distinguishes it from the letter "X".
  • Figure 4: Diversity in our Dataset. Each of the images is taken from a randomly sampled video from our dataset. Our dataset has a large variance in the appearance of subjects, background, occlusion, and lighting conditions in the videos.
  • Figure 5: Scaling Training Frames. Performance of models across different numbers of training frames. The Pareto Frontier is represented by a black curve ( ). Note that this dataset requires models to attend through the entire video to perform well, and increasing the number of frames that models attend to significantly increases their performance.
  • ...and 5 more figures