Table of Contents
Fetching ...

SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking

Weiguang Zhao, Haoran Xu, Xingyu Miao, Qin Zhao, Rui Zhang, Kaizhu Huang, Ning Gao, Peizhou Cao, Mingze Sun, Mulin Yu, Tao Lu, Linning Xu, Junting Dong, Jiangmiao Pang

TL;DR

SynthVerse tackles the data bottleneck in general point tracking by introducing a large-scale synthetic dataset generated via a cross-platform Blender+Isaac Sim pipeline. It offers broad domain coverage, including articulated and deformable objects, humans/animals, and embodied/humanoid interactions, with dense 3D trajectories and visibility annotations, plus a multi-domain benchmark spanning Nav, Film, Embodied, and other domains. Empirical results show that fine-tuning state-of-the-art trackers (e.g., TAPIP3D) on SynthVerse improves 3D/2D tracking performance and generalization across synthetic and real-world datasets, while also exposing limitations under domain shifts. The work demonstrates the value of synthetic diversity for robust point tracking and lays groundwork for broader sim-to-real transfer and future model benchmarking.

Abstract

Point tracking aims to follow visual points through complex motion, occlusion, and viewpoint changes, and has advanced rapidly with modern foundation models. Yet progress toward general point tracking remains constrained by limited high-quality data, as existing datasets often provide insufficient diversity and imperfect trajectory annotations. To this end, we introduce SynthVerse, a large-scale, diverse synthetic dataset specifically designed for point tracking. SynthVerse includes several new domains and object types missing from existing synthetic datasets, such as animated-film-style content, embodied manipulation, scene navigation, and articulated objects. SynthVerse substantially expands dataset diversity by covering a broader range of object categories and providing high-quality dynamic motions and interactions, enabling more robust training and evaluation for general point tracking. In addition, we establish a highly diverse point tracking benchmark to systematically evaluate state-of-the-art methods under broader domain shifts. Extensive experiments and analyses demonstrate that training with SynthVerse yields consistent improvements in generalization and reveal limitations of existing trackers under diverse settings.

SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking

TL;DR

SynthVerse tackles the data bottleneck in general point tracking by introducing a large-scale synthetic dataset generated via a cross-platform Blender+Isaac Sim pipeline. It offers broad domain coverage, including articulated and deformable objects, humans/animals, and embodied/humanoid interactions, with dense 3D trajectories and visibility annotations, plus a multi-domain benchmark spanning Nav, Film, Embodied, and other domains. Empirical results show that fine-tuning state-of-the-art trackers (e.g., TAPIP3D) on SynthVerse improves 3D/2D tracking performance and generalization across synthetic and real-world datasets, while also exposing limitations under domain shifts. The work demonstrates the value of synthetic diversity for robust point tracking and lays groundwork for broader sim-to-real transfer and future model benchmarking.

Abstract

Point tracking aims to follow visual points through complex motion, occlusion, and viewpoint changes, and has advanced rapidly with modern foundation models. Yet progress toward general point tracking remains constrained by limited high-quality data, as existing datasets often provide insufficient diversity and imperfect trajectory annotations. To this end, we introduce SynthVerse, a large-scale, diverse synthetic dataset specifically designed for point tracking. SynthVerse includes several new domains and object types missing from existing synthetic datasets, such as animated-film-style content, embodied manipulation, scene navigation, and articulated objects. SynthVerse substantially expands dataset diversity by covering a broader range of object categories and providing high-quality dynamic motions and interactions, enabling more robust training and evaluation for general point tracking. In addition, we establish a highly diverse point tracking benchmark to systematically evaluate state-of-the-art methods under broader domain shifts. Extensive experiments and analyses demonstrate that training with SynthVerse yields consistent improvements in generalization and reveal limitations of existing trackers under diverse settings.
Paper Structure (14 sections, 5 figures, 6 tables)

This paper contains 14 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Our SythnVerse Dataset
  • Figure 2: Data Generation Pipeline. Shot Production denotes publicly released shot-level production project files from selected clips of animated films. Technical Components refer to a collection of technical methods and tool modules used during scene construction to facilitate scene layout, motion setup, and related production controls.
  • Figure 3: Data Augmentation. The first row shows randomized texture replacements for hands and tabletops. The second row illustrates randomization of flower-petal materials (e.g., base color/albedo). The third row demonstrates camera and environment augmentation, including randomizing camera FOV and HDR environment maps.
  • Figure 4: SynthVerse Benchmark Radar. For simplicity, we denote SpatialTrackerV2-online as SpatialTrackerV2 and refer to TAPIP3D-world as TAPIP3D.
  • Figure 5: Qualitative Comparison on SynthVerse Benchmark. For simplicity, we denote TAPIP3D-world as TAPIP3D and refer to TAPIP3D-world$^{*}$ as TAPIP3D$^{*}$. ($^*$) stands for the results of the finetune model.