Table of Contents
Fetching ...

Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, Eric Eaton

TL;DR

Articulate-Anything automates the articulation of diverse 3D objects from multimodal inputs by framing articulation as a program-synthesis problem solved via a vision-language actor-critic system that outputs Python code compiled to URDFs. The pipeline—mesh retrieval, link placement, and joint prediction—uses grounded feedback from a critic to iteratively refine solutions. It achieves state-of-the-art performance on PartNet-Mobility (approximately 75% success) and demonstrates real-world utility by generating assets from in-the-wild videos to train and transfer robotic policies to a real robot. This approach enables scalable creation of rich, interactive digital twins for AR/VR and robotics applications, reducing manual labor and enabling broader simulation-to-real-world transfer.

Abstract

Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust outcome. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6% to 75% and setting a new bar for state-of-the-art performance. We further showcase the utility of our system by generating 3D assets from in-the-wild video inputs, which are then used to train robotic policies for fine-grained manipulation tasks in simulation that go beyond basic pick and place. These policies are then transferred to a real robotic system.

Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

TL;DR

Articulate-Anything automates the articulation of diverse 3D objects from multimodal inputs by framing articulation as a program-synthesis problem solved via a vision-language actor-critic system that outputs Python code compiled to URDFs. The pipeline—mesh retrieval, link placement, and joint prediction—uses grounded feedback from a critic to iteratively refine solutions. It achieves state-of-the-art performance on PartNet-Mobility (approximately 75% success) and demonstrates real-world utility by generating assets from in-the-wild videos to train and transfer robotic policies to a real robot. This approach enables scalable creation of rich, interactive digital twins for AR/VR and robotics applications, reducing manual labor and enabling broader simulation-to-real-world transfer.

Abstract

Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust outcome. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6% to 75% and setting a new bar for state-of-the-art performance. We further showcase the utility of our system by generating 3D assets from in-the-wild video inputs, which are then used to train robotic policies for fine-grained manipulation tasks in simulation that go beyond basic pick and place. These policies are then transferred to a real robotic system.

Paper Structure

This paper contains 29 sections, 5 equations, 25 figures, 3 tables.

Figures (25)

  • Figure 1: Given text, images, or videos showing an object's motion, Articulate-Anything automatically generates its 3D interactable digital twin, handling a wide variety of objects and affordances. Among other applications, these articulated assets can be used to train robotic manipulation policies in Sim2Real. Full video demonstrations and source code are available on the website.
  • Figure 2: Method Overview. Given a text, image or video input, Articulate-Anything operates in three stages: (1) Mesh Retrieval (Sec. \ref{['sub:mesh_retrieval']}) retrieves a mesh for each object part from a 3D asset library, (2) Link Placement (Sec. \ref{['sub:link_placement']}) places the parts together, and (3) Joint Prediction (Sec. \ref{['sub:joint_prediction']}) predicts the allowed kinematic movements between parts. Optionally, instead of generating all possible kinematic joints, we can target a specific joint from the input video (Sec. \ref{['sub:targeted_affordance']}). The link placement and joint prediction systems consist of an actor and a critic, which are VLMs working together. The actor proposes solutions, and the critic examines those solutions and gives feedback.
  • Figure 3: Mesh retrieval. The top and bottom diagrams provide overviews for reconstructing visual (i.e., image or video) and text inputs, respectively. For visual input, we match the ground-truth object to a template object in the library using an efficient divide-and-conquer retrieval mechanism. For text input, we first prompt an LLM to predict the different object parts and their dimensions. Then, we retrieve a mesh for each part using precomputed CLIP embeddings and subsequently scale the meshes to specifications. More details in Sec. \ref{['sub:mesh_retrieval']}.
  • Figure 4: Both link placement and joint prediction systems consist of an actor and a critic. The actor produces Python code, which is automatically compiled into URDFs and rendered in simulation. Source code, predicted, and input modalities (images for link and videos for joint) are given to the critic for evaluation. The synergy between the actor and critic enables self-correction of errors (red border) and successful articulation (green border).
  • Figure 5: Comparison against the baselines. Our approach significantly outperforms all baselines in the joint prediction task. We use few-shot prompting and make no distinction between ID and OOD classes, so we only report results for all classes. 95% confidence intervals are included.
  • ...and 20 more figures