STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
Palaash Agrawal, Haidi Azaman, Cheston Tan
TL;DR
STUPD introduces a large-scale synthetic dataset that jointly targets static/dynamic spatial prepositions and temporal relations to advance visual relationship reasoning. Built in Unity3D with 3D object metadata, STUPD provides 150K Spatial-STUPD samples across 30 senses and 50K Temporal-STUPD samples across 10 senses, enabling robust pretraining for real-world tasks. Empirical results show that pretraining on STUPD improves performance on real-world datasets (e.g., SpatialSense and ImageNet-VidVRD), with dynamic/spatio-temporal cues offering particularly strong transfer, while highlighting the value of 3D information and explicit temporal reasoning. The work discusses limitations like sense ambiguity and limited object diversity, and outlines future directions to broaden sense disambiguation, object types, and deeper 3D reasoning to further close the gap between synthetic and real-world visual understanding.
Abstract
Understanding relations between objects is crucial for understanding the semantics of a visual scene. It is also an essential step in order to bridge visual and language models. However, current state-of-the-art computer vision models still lack the ability to perform spatial reasoning well. Existing datasets mostly cover a relatively small number of spatial relations, all of which are static relations that do not intrinsically involve motion. In this paper, we propose the Spatial and Temporal Understanding of Prepositions Dataset (STUPD) -- a large-scale video dataset for understanding static and dynamic spatial relationships derived from prepositions of the English language. The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses, in the form of object interaction simulations generated synthetically using Unity3D. In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions. To our knowledge, no dataset exists that represents temporal relations through visual settings. In this dataset, we also provide 3D information about object interactions such as frame-wise coordinates, and descriptions of the objects used. The goal of this synthetic dataset is to help models perform better in visual relationship detection in real-world settings. We demonstrate an increase in the performance of various models over 2 real-world datasets (ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in comparison to other pretraining datasets.
