Table of Contents
Fetching ...

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

Rohit Girdhar, Deva Ramanan

TL;DR

CATER addresses the gap in video understanding benchmarks by introducing a synthetic, CLEVR-inspired dataset with fully controllable biases and occlusions that force long-range spatiotemporal reasoning. It defines three progressively harder tasks—atomic action recognition, compositional action recognition, and snitch localization—rooted in Allen's interval algebra to probe temporal reasoning. Through experiments with state-of-the-art video models and LSTM-based aggregation, the authors show that current architectures struggle on CATER, particularly for localization under occlusion and containment, and that temporal modeling substantially improves performance. The dataset, along with diagnostic tools and metadata, provides a rigorous platform to study and drive advances in long-term video understanding beyond conventional benchmarks.

Abstract

Computer vision has undergone a dramatic revolution in performance, driven in large part through deep features trained on large-scale supervised datasets. However, much of these improvements have focused on static image analysis; video understanding has seen rather modest improvements. Even though new datasets and spatiotemporal models have been proposed, simple frame-by-frame classification methods often still remain competitive. We posit that current video datasets are plagued with implicit biases over scene and object structure that can dwarf variations in temporal structure. In this work, we build a video dataset with fully observable and controllable object and scene bias, and which truly requires spatiotemporal understanding in order to be solved. Our dataset, named CATER, is rendered synthetically using a library of standard 3D objects, and tests the ability to recognize compositions of object movements that require long-term reasoning. In addition to being a challenging dataset, CATER also provides a plethora of diagnostic tools to analyze modern spatiotemporal video architectures by being completely observable and controllable. Using CATER, we provide insights into some of the most recent state of the art deep video architectures.

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

TL;DR

CATER addresses the gap in video understanding benchmarks by introducing a synthetic, CLEVR-inspired dataset with fully controllable biases and occlusions that force long-range spatiotemporal reasoning. It defines three progressively harder tasks—atomic action recognition, compositional action recognition, and snitch localization—rooted in Allen's interval algebra to probe temporal reasoning. Through experiments with state-of-the-art video models and LSTM-based aggregation, the authors show that current architectures struggle on CATER, particularly for localization under occlusion and containment, and that temporal modeling substantially improves performance. The dataset, along with diagnostic tools and metadata, provides a rigorous platform to study and drive advances in long-term video understanding beyond conventional benchmarks.

Abstract

Computer vision has undergone a dramatic revolution in performance, driven in large part through deep features trained on large-scale supervised datasets. However, much of these improvements have focused on static image analysis; video understanding has seen rather modest improvements. Even though new datasets and spatiotemporal models have been proposed, simple frame-by-frame classification methods often still remain competitive. We posit that current video datasets are plagued with implicit biases over scene and object structure that can dwarf variations in temporal structure. In this work, we build a video dataset with fully observable and controllable object and scene bias, and which truly requires spatiotemporal understanding in order to be solved. Our dataset, named CATER, is rendered synthetically using a library of standard 3D objects, and tests the ability to recognize compositions of object movements that require long-term reasoning. In addition to being a challenging dataset, CATER also provides a plethora of diagnostic tools to analyze modern spatiotemporal video architectures by being completely observable and controllable. Using CATER, we provide insights into some of the most recent state of the art deep video architectures.

Paper Structure

This paper contains 9 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Real world video understanding. Consider this iconic movie scene from The Godfather in (a), where the protagonist leaves the table, goes to the bathroom to extract a hidden firearm, and returns to the table presumably with the intentions of shooting a person. While the gun itself is visible in only a few frames of the whole clip, it is trivial for us to realize that the protagonist has it in the last frame. An even simpler instantiation of such a reasoning task could be the cup-and-ball shell game in (b), where the task is to determine which of the cups contain the ball at the end of the trick. Can we design similarly hard tasks for computers?
  • Figure 2: CATER dataset and tasks. Sampled frames from a random video from CATER. We show some of the actions afforded by objects in this video, as labeled on the top using arrows. We define three tasks on these videos. Task 1 requires identifying all active actions in the video. Task 2 requires identifying all active compositional actions. Task 3 requires quantized spatial localization of the snitch object at the end of the video. Note that, as in this case, the snitch may be occluded or 'contained' by another object, and hence models would require spatiotemporal understanding to complete the task. Please refer to the supplementary video for more example videos.
  • Figure 3: Allen's temporal algebra. Exhaustive list of temporal relations between intervals, as defined by Allen's algebra allen1983maintaining. For simplicity, we group them into three broad relations to define classes for composite actions, although in principle we could use all thirteen. Figure courtesy of allen_figure.
  • Figure 4: Long term reasoning. Comparing the best reported performance of standard models on existing datasets and CATER (task 3). Unlike previous benchmarks, (1) temporal modeling using LSTM helps and (2) local temporal cues (flow) are not effective by itself on CATER. 2S here refers to 'Two Stream'. TSN performance from tsn_kineticstsn_full_perf.
  • Figure 5: Diagnostic analysis of localization performance. We bin the test set using certain parameters. For each, we show the test set distribution with the bar graph, the performance over that bin using the line plot, and performance of that model on the full val set with the dotted line. We find that localization performance, (a) Drops significantly if the snitch is kept moving till the end. This is possibly because for cases when snitch only moves in the beginning and is static after, the models have a lot more evidence to predict the correct location from. Interestingly the tracker is much less affected by this, as it tracks the snitch until the very end; (b) Drops if the snitch is 'contain'-ed by another object in the end, and the tracker is the worst affected by it; (c) Drops initially with increasing displacement of the snitch from its start position, but is stable after that; and (d) Is relatively stable with different number of objects in the scene.
  • ...and 2 more figures