Table of Contents
Fetching ...

TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

Ziqian Wang, Yonghao He, Licheng Yang, Wei Zou, Hongxuan Ma, Liu Liu, Wei Sui, Yuxin Guo, Hu Su

TL;DR

<3-5 sentence high-level summary>

Abstract

Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI -- especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.

TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

TL;DR

<3-5 sentence high-level summary>

Abstract

Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI -- especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.

Paper Structure

This paper contains 44 sections, 6 equations, 25 figures, 3 tables.

Figures (25)

  • Figure 1: We present TabletopGen, a training-free, fully automatic unified framework that generates instance-level interactive 3D tabletop scenes. As shown on the left, TabletopGen can generate visually realistic, detail-rich, plausibly arranged, and collision-free 3D scenes from either text or a single image input. As shown on the right, our framework can produce a wide variety of tabletop scenes, spanning different shapes, styles, and functional categories.
  • Figure 2: Overview of our TabletopGen Framework. Our framework accepts either text (which is first converted into a reference image) or a single image. Starting from the image, we proceed in four stages: (1) Instance Extraction performs category analysis, segmentation, and completion to obtain clean, high-resolution per-instance images. (2) Canonical Model Generation uses Image-to-3D and MLLM-based alignment to create a 3D model with canonical coordinate system for each instance. (3) Our core Pose and Scale Alignment stage recovers the spatial layout. The DRO (Differentiable Rotation Optimizer) estimates rotation by optimizing a tri-modal loss, while the TSA (Top-view Spatial Alignment) mechanism synthesizes a top-view image and, together with MLLM reasoning, selects an anchor instance via our RMA-Score to infer each instance’s translation and scale. (4) 3D Scene Assembly stage combines all instance models with their poses and scales in a simulator to produce the final collision-free, interactive 3D tabletop scene.
  • Figure 3: Qualitative comparison under the same input images. Our method, TabletopGen, consistently outperforms all baselines. TabletopGen demonstrates strong adaptability across diverse tabletop types, delivering more realistic appearances, finer instance models, more coherent object counts and layouts, and collision-free placement.
  • Figure 4: Qualitative ablation study on pose and scale alignment components. Compared to our full model (Ours), removing DRO yields incorrect instance rotations (yellow circles), removing TSA causes misplacements (blue parallelograms), and removing both amplifies these errors, often leading to severe occlusions and collisions (red rectangles).
  • Figure 5: Qualitative comparison under the same input texts. Compared with the recent tabletop generation method MesaTask, TabletopGen generates more complete and realistic scenes, including stylistically consistent tables, more detailed instance models, and more semantically and physically reasonable layouts.
  • ...and 20 more figures