Real2Code: Reconstruct Articulated Objects via Code Generation

Zhao Mandi; Yijia Weng; Dominik Bauer; Shuran Song

Real2Code: Reconstruct Articulated Objects via Code Generation

Zhao Mandi, Yijia Weng, Dominik Bauer, Shuran Song

TL;DR

Real2Code tackles the challenge of reconstructing richly articulated objects from RGB observations by reframing joint prediction as executable code generation conditioned on compact OBB abstractions. The method splits the problem into part-level geometry reconstruction (via kinematics-aware segmentation and shape completion) and LLM-driven articulation prediction, fine-tuned to output MuJoCo-executable Python code. Key innovations include OBB-based input, test-time view-consistent prompts, and LoRA-fine-tuned CodeLlama for scalable, multi-part articulation with up to 10 joints. Experiments on PartNet-Mobility demonstrate state-of-the-art reconstruction and articulation accuracy, with qualitative real-world results showing robust generalization from RGB cues. This approach enables rapid generation of simulatable digital twins for robotics and VR/AR applications.

Abstract

We present Real2Code, a novel approach to reconstructing articulated objects via code generation. Given visual observations of an object, we first reconstruct its part geometry using an image segmentation model and a shape completion model. We then represent the object parts with oriented bounding boxes, which are input to a fine-tuned large language model (LLM) to predict joint articulation as code. By leveraging pre-trained vision and language models, our approach scales elegantly with the number of articulated parts, and generalizes from synthetic training data to real world objects in unstructured environments. Experimental results demonstrate that Real2Code significantly outperforms previous state-of-the-art in reconstruction accuracy, and is the first approach to extrapolate beyond objects' structural complexity in the training set, and reconstructs objects with up to 10 articulated parts. When incorporated with a stereo reconstruction model, Real2Code also generalizes to real world objects from a handful of multi-view RGB images, without the need for depth or camera information.

Real2Code: Reconstruct Articulated Objects via Code Generation

TL;DR

Abstract

Paper Structure (24 sections, 10 figures, 3 tables)

This paper contains 24 sections, 10 figures, 3 tables.

Introduction
Related Work
Method
Part Reconstruction
Kinematics-aware Part Segmentation
Test-time Prompting for View-consistent Segmentation.
Part-level Shape Completion.
Articulation Prediction via Code Generation
Oriented Bounding Box as Input Abstraction.
Fine-tuning a Code Generation LLM.
Experiments
Experiment Setup
Part Segmentation and Reconstruction Experiments
Articulation Prediction Experiments
Qualitative Results
...and 9 more sections

Figures (10)

Figure 1: We propose a novel method for reconstructing articulated objects via code generation, leveraging pre-trained large language models (LLMs). Real2Code takes visual observations as input, and performs both part-level geometry reconstruction and joint prediction. When evaluated on an extensive set of real and synthetic objects with varying level of kinematic complexity, Real2Code can reconstruct complex articulated objects with up to 10 parts, and generalize to real world objects from a handful of pose-free RGB images.
Figure 2: Overview of our proposed pipeline. Given unstructured multi-view RGB images, we leverage the pre-trained DUSt3R model wang2023dust3r to obtain dense 2D-to-3D pointmaps, and use a fine-tuned 2D segmentation modelsam to perform part-level segmentation and project to segmented 3D point clouds. A learned shape-completion model takes partial point cloud inputs and predicts a dense occupancy field, which is used for part-level mesh extraction. We fine-tune a large language model (LLM) codellama that takes mesh information in the form of oriented bounding boxes, and outputs full code descriptions of the object that can directly be executed in simulation.
Figure 3: View-consistent segmentation. Illustration of our method for test-time prompting the fine-tuned SAM model. We first sample 3D points from the foreground object point clouds, project each point onto 2D RGB images captured from different camera views, which are used to prompt the model to generate view-consistent segmentations.
Figure 4: Articulation Prediction as Code. We fine-tune a Codellama model that takes in oriented bounding boxes (OBBs) for segmented parts as input, and generates joint predictions via selecting OBB rotation axes and edges (model generation is highlighted in green). A helper function is used to compute the absolute joint axis and position that assembles the object parts in simulation
Figure 5: Qualitative results that compare Real2Code to baseline methods. We show results from objects with a range of varying kinematic complexities, from a two-part laptop to a ten-part multi-drawer table. Whereas all methods can handle the simpler laptop articulation, baseline methods struggle as the number of object parts increases, and Real2Code performs reconstruction much more accurately. PARIS runs out of memory and fails to run on the ten-part table ('N/A').
...and 5 more figures

Real2Code: Reconstruct Articulated Objects via Code Generation

TL;DR

Abstract

Real2Code: Reconstruct Articulated Objects via Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)