Table of Contents
Fetching ...

KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins

Quanyun Wu, Kyle Gao, Daniel Long, David A. Clausi, Jonathan Li, Yuhao Chen

Abstract

Embodied AI training and evaluation require object-centric digital twin environments with accurate metric geometry and semantic grounding. Recent transformer-based feedforward reconstruction methods can efficiently predict global point clouds from sparse monocular videos, yet these geometries suffer from inherent scale ambiguity and inconsistent coordinate conventions. This mismatch prevents the reliable fusion of these dimensionless point cloud predictions with locally reconstructed object meshes. We propose a novel scale-aware 3D fusion framework that registers visually grounded object meshes with transformer-predicted global point clouds to construct metrically consistent digital twins. Our method introduces a Vision-Language Model (VLM)-guided geometric anchor mechanism that resolves this fundamental coordinate mismatch by recovering an accurate real-world metric scale. To fuse these networks, we propose a geometry-aware registration pipeline that explicitly enforces physical plausibility through gravity-aligned vertical estimation, Manhattan-world structural constraints, and collision-free local refinement. Experiments on real indoor kitchen environments demonstrate improved cross-network object alignment and geometric consistency for downstream tasks, including multi-primitive fitting and metric measurement. We additionally introduce an open-source indoor digital twin dataset with metrically scaled scenes and semantically grounded and registered object-centric mesh annotations.

KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins

Abstract

Embodied AI training and evaluation require object-centric digital twin environments with accurate metric geometry and semantic grounding. Recent transformer-based feedforward reconstruction methods can efficiently predict global point clouds from sparse monocular videos, yet these geometries suffer from inherent scale ambiguity and inconsistent coordinate conventions. This mismatch prevents the reliable fusion of these dimensionless point cloud predictions with locally reconstructed object meshes. We propose a novel scale-aware 3D fusion framework that registers visually grounded object meshes with transformer-predicted global point clouds to construct metrically consistent digital twins. Our method introduces a Vision-Language Model (VLM)-guided geometric anchor mechanism that resolves this fundamental coordinate mismatch by recovering an accurate real-world metric scale. To fuse these networks, we propose a geometry-aware registration pipeline that explicitly enforces physical plausibility through gravity-aligned vertical estimation, Manhattan-world structural constraints, and collision-free local refinement. Experiments on real indoor kitchen environments demonstrate improved cross-network object alignment and geometric consistency for downstream tasks, including multi-primitive fitting and metric measurement. We additionally introduce an open-source indoor digital twin dataset with metrically scaled scenes and semantically grounded and registered object-centric mesh annotations.

Paper Structure

This paper contains 15 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of scenarios. Left image: Output result of digital twin. Middle: Traditional fusion reconstruction method, in which objects (such as a bottle) share the geometric structure with the environment, thus unable to achieve physical independence. Right image: Our digital twin model based on objects, in which the objects are structurally independent and are physically operable meshes.
  • Figure 2: Our proposed pipeline for semantic and geometrically grounded object-centric scene reconstruction. Stream A (Top) reconstructs the global metric geometry via VLM-aided scale recovery. Stream B (Bottom) generates high-fidelity object meshes from optimal 2D views. Stage C (Right) fuses these streams through a geometrically grounded hierarchical registration process, enforcing physical plausibility and geometric constraints to produce a refined digital twin.
  • Figure 3: Visual progression of our 3D fusion framework compared to the baseline. From left to right: (1) The original Input Frame captured from the sequence. (2) The dense, unscaled Point Cloud reconstructed by Pi-Long. (3) Our Mesh, demonstrating that after scale recovery and geometries-aware registration, the lifted object meshes perfectly align within the physical 2D bounding box (red) when projected back to the camera view. (4) The naive SAM3D Mesh baseline, which fails to establish a coherent metric space, resulting in severe misalignment or absence within the ground truth bounding box when re-rendered.
  • Figure 4: Qualitative comparison of the assembled 3D digital twin. Left: The baseline fails to establish a metric space, resulting in floating, unscaled, and overlapping artifacts. Right: With geometric grounding, our method produces a physically plausible, Manhattan-aligned, and tightly registered object-centric scene without subsurface penetration.