Table of Contents
Fetching ...

SynTable: A Synthetic Data Generation Pipeline for Unseen Object Amodal Instance Segmentation of Cluttered Tabletop Scenes

Zhili Ng, Haozhe Wang, Zhengshen Zhang, Francis Tay Eng Hock, Marcelo H. Ang

TL;DR

SynTable introduces a photorealistic, end-to-end synthetic data generation pipeline built on NVIDIA Isaac Sim to address the lack of labeled UOAIS data and the Sim-to-Real gap in cluttered tabletop scenes. By automatically producing rich ground-truth including modal and amodal masks, occlusion data, depth, and occlusion order graphs, and by synthesizing a large-scale dataset (SynTable-Sim) with 1075 novel objects, the approach enables effective training of state-of-the-art UOAIS models. The work also defines the Occlusion Order Accuracy ($ACC_{OO}$) and associated OOAM/OODG representations to quantify occlusion reasoning, and demonstrates substantial improvements in real-world transfer on OSD-Amodal across multiple architectures. The authors provide open-source tooling and datasets to facilitate replication and broader adoption in robotics and AR applications where occlusion-aware perception is essential.

Abstract

In this work, we present SynTable, a unified and flexible Python-based dataset generator built using NVIDIA's Isaac Sim Replicator Composer for generating high-quality synthetic datasets for unseen object amodal instance segmentation of cluttered tabletop scenes. Our dataset generation tool can render complex 3D scenes containing object meshes, materials, textures, lighting, and backgrounds. Metadata, such as modal and amodal instance segmentation masks, object amodal RGBA instances, occlusion masks, depth maps, bounding boxes, and material properties can be automatically generated to annotate the scene according to the users' requirements. Our tool eliminates the need for manual labeling in the dataset generation process while ensuring the quality and accuracy of the dataset. In this work, we discuss our design goals, framework architecture, and the performance of our tool. We demonstrate the use of a sample dataset generated using SynTable for training a state-of-the-art model, UOAIS-Net. Our state-of-the-art results show significantly improved performance in Sim-to-Real transfer when evaluated on the OSD-Amodal dataset. We offer this tool as an open-source, easy-to-use, photorealistic dataset generator for advancing research in deep learning and synthetic data generation. The links to our source code, demonstration video, and sample dataset can be found in the supplementary materials.

SynTable: A Synthetic Data Generation Pipeline for Unseen Object Amodal Instance Segmentation of Cluttered Tabletop Scenes

TL;DR

SynTable introduces a photorealistic, end-to-end synthetic data generation pipeline built on NVIDIA Isaac Sim to address the lack of labeled UOAIS data and the Sim-to-Real gap in cluttered tabletop scenes. By automatically producing rich ground-truth including modal and amodal masks, occlusion data, depth, and occlusion order graphs, and by synthesizing a large-scale dataset (SynTable-Sim) with 1075 novel objects, the approach enables effective training of state-of-the-art UOAIS models. The work also defines the Occlusion Order Accuracy () and associated OOAM/OODG representations to quantify occlusion reasoning, and demonstrates substantial improvements in real-world transfer on OSD-Amodal across multiple architectures. The authors provide open-source tooling and datasets to facilitate replication and broader adoption in robotics and AR applications where occlusion-aware perception is essential.

Abstract

In this work, we present SynTable, a unified and flexible Python-based dataset generator built using NVIDIA's Isaac Sim Replicator Composer for generating high-quality synthetic datasets for unseen object amodal instance segmentation of cluttered tabletop scenes. Our dataset generation tool can render complex 3D scenes containing object meshes, materials, textures, lighting, and backgrounds. Metadata, such as modal and amodal instance segmentation masks, object amodal RGBA instances, occlusion masks, depth maps, bounding boxes, and material properties can be automatically generated to annotate the scene according to the users' requirements. Our tool eliminates the need for manual labeling in the dataset generation process while ensuring the quality and accuracy of the dataset. In this work, we discuss our design goals, framework architecture, and the performance of our tool. We demonstrate the use of a sample dataset generated using SynTable for training a state-of-the-art model, UOAIS-Net. Our state-of-the-art results show significantly improved performance in Sim-to-Real transfer when evaluated on the OSD-Amodal dataset. We offer this tool as an open-source, easy-to-use, photorealistic dataset generator for advancing research in deep learning and synthetic data generation. The links to our source code, demonstration video, and sample dataset can be found in the supplementary materials.
Paper Structure (32 sections, 14 equations, 11 figures, 7 tables, 2 algorithms)

This paper contains 32 sections, 14 equations, 11 figures, 7 tables, 2 algorithms.

Figures (11)

  • Figure 1: (a) RGB outputs of photorealistic cluttered tabletop scenes generated by SynTable pipeline. (b) Visualization of RGB Images, Depth Images, Object Amodal Masks, Object Visible Masks, Object Occlusion Masks, and Object Visible Bounding Boxes.
  • Figure 2: High-level overview of synthetic data generation pipeline.
  • Figure 3: The process of capturing annotations for a scene. For each viewpoint, (a) RGB and depth with all objects (b) object visible masks & bounding box, (c) object amodal masks (including object amodal RGBA instances), (d) object occlusion masks and occlusion rate, (e) occlusion order adjacency matrix are captured.
  • Figure 4: Initialization of objects with randomized coordinates and rotations. The initial position of the objects in the scene is randomized but constrained to be within the dimensions of the 3D orange box. The orange box is 0.2 m above the tabletop. The roll, pitch, and yaw of each object are also randomly sampled within the range of $0\degree$ to $360\degree$.
  • Figure 5: Sampling of camera viewpoints within concentric hemispheres (shown in blue). The two concentric hemispheres’ origins are centered at the tabletop surface’s center coordinate with an offset of 0.2 m in the positive $z$ direction in the world frame. This allows the camera viewpoints to minimally have a direct line of sight to the tabletop surface to capture part of the tabletop plane. This figure is best viewed zoomed in.
  • ...and 6 more figures