Table of Contents
Fetching ...

STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning

Palaash Agrawal, Haidi Azaman, Cheston Tan

TL;DR

STUPD introduces a large-scale synthetic dataset that jointly targets static/dynamic spatial prepositions and temporal relations to advance visual relationship reasoning. Built in Unity3D with 3D object metadata, STUPD provides 150K Spatial-STUPD samples across 30 senses and 50K Temporal-STUPD samples across 10 senses, enabling robust pretraining for real-world tasks. Empirical results show that pretraining on STUPD improves performance on real-world datasets (e.g., SpatialSense and ImageNet-VidVRD), with dynamic/spatio-temporal cues offering particularly strong transfer, while highlighting the value of 3D information and explicit temporal reasoning. The work discusses limitations like sense ambiguity and limited object diversity, and outlines future directions to broaden sense disambiguation, object types, and deeper 3D reasoning to further close the gap between synthetic and real-world visual understanding.

Abstract

Understanding relations between objects is crucial for understanding the semantics of a visual scene. It is also an essential step in order to bridge visual and language models. However, current state-of-the-art computer vision models still lack the ability to perform spatial reasoning well. Existing datasets mostly cover a relatively small number of spatial relations, all of which are static relations that do not intrinsically involve motion. In this paper, we propose the Spatial and Temporal Understanding of Prepositions Dataset (STUPD) -- a large-scale video dataset for understanding static and dynamic spatial relationships derived from prepositions of the English language. The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses, in the form of object interaction simulations generated synthetically using Unity3D. In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions. To our knowledge, no dataset exists that represents temporal relations through visual settings. In this dataset, we also provide 3D information about object interactions such as frame-wise coordinates, and descriptions of the objects used. The goal of this synthetic dataset is to help models perform better in visual relationship detection in real-world settings. We demonstrate an increase in the performance of various models over 2 real-world datasets (ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in comparison to other pretraining datasets.

STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning

TL;DR

STUPD introduces a large-scale synthetic dataset that jointly targets static/dynamic spatial prepositions and temporal relations to advance visual relationship reasoning. Built in Unity3D with 3D object metadata, STUPD provides 150K Spatial-STUPD samples across 30 senses and 50K Temporal-STUPD samples across 10 senses, enabling robust pretraining for real-world tasks. Empirical results show that pretraining on STUPD improves performance on real-world datasets (e.g., SpatialSense and ImageNet-VidVRD), with dynamic/spatio-temporal cues offering particularly strong transfer, while highlighting the value of 3D information and explicit temporal reasoning. The work discusses limitations like sense ambiguity and limited object diversity, and outlines future directions to broaden sense disambiguation, object types, and deeper 3D reasoning to further close the gap between synthetic and real-world visual understanding.

Abstract

Understanding relations between objects is crucial for understanding the semantics of a visual scene. It is also an essential step in order to bridge visual and language models. However, current state-of-the-art computer vision models still lack the ability to perform spatial reasoning well. Existing datasets mostly cover a relatively small number of spatial relations, all of which are static relations that do not intrinsically involve motion. In this paper, we propose the Spatial and Temporal Understanding of Prepositions Dataset (STUPD) -- a large-scale video dataset for understanding static and dynamic spatial relationships derived from prepositions of the English language. The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses, in the form of object interaction simulations generated synthetically using Unity3D. In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions. To our knowledge, no dataset exists that represents temporal relations through visual settings. In this dataset, we also provide 3D information about object interactions such as frame-wise coordinates, and descriptions of the objects used. The goal of this synthetic dataset is to help models perform better in visual relationship detection in real-world settings. We demonstrate an increase in the performance of various models over 2 real-world datasets (ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in comparison to other pretraining datasets.
Paper Structure (50 sections, 8 figures, 9 tables)

This paper contains 50 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Some examples of Spatial-STUPD, which contains 30 spatial relations. These relations can be divided into two categories - static (involving no motion) and dynamic (involving relative motion between the subject and object)
  • Figure 2: We propose 10 temporal relations representing interactions between different events or time points within a specified temporal window of $W$ frames. Different temporal prepositions are used in specific contexts in English. For each relation, A, B, and/or C can be an event(E), time point(T) or either event or a time point(E/T). Each temporal relation can have multiple types of event/time point interactions. The translucent shade of certain events in the figure represents the possible variation in the point of occurrence.
  • Figure 3: Dataset statistics. (a) The occurrence of prefab categories is roughly consistent throughout the dataset. (b) The blue line represents the minimum number of temporal relation occurrence. A single temporal interaction can have multiple temporal relation predicates associated.
  • Figure 4: Overview of 3D prefabs used. We curate a total of 183 prefabs across 45 categories and 8 supercategories. We also try to have a balanced set of person prefabs to address certain ethical concerns (bottom right). (Also see Figure \ref{['fig:ethnic distro']})
  • Figure 5: Ethnic distribution statistics for the person prefab.
  • ...and 3 more figures