Table of Contents
Fetching ...

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu

TL;DR

PhysX-Anything tackles the gap between visually plausible 3D generation and physically grounded, simulation-ready assets by introducing a unified VLM-based pipeline that jointly predicts geometry, articulation, and physical attributes from a single image. It solves token-efficiency challenges with a coarse-to-fine voxel representation that compresses geometry tokens by $193\times$ without extra tokenizer, and decodes results into URDF/XML and meshes via a controllable flow transformer and diffusion-based refinements. To support broad real-world deployment, the authors assemble PhysX-Mobility, a physically annotated 3D dataset spanning 47 categories and over 2K objects, significantly expanding prior coverage. Extensive experiments show strong generalization across in-the-wild images, ablations validate the representation choices, and simulation-based robotics tasks in MuJoCo demonstrate direct applicability for policy learning and manipulation under contact-rich scenarios. Collectively, PhysX-Anything enables sim-ready physical 3D generation with end-to-end deployability in standard simulators, offering tangible benefits for embodied AI, robotics, and physics-based simulation.

Abstract

3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

TL;DR

PhysX-Anything tackles the gap between visually plausible 3D generation and physically grounded, simulation-ready assets by introducing a unified VLM-based pipeline that jointly predicts geometry, articulation, and physical attributes from a single image. It solves token-efficiency challenges with a coarse-to-fine voxel representation that compresses geometry tokens by without extra tokenizer, and decodes results into URDF/XML and meshes via a controllable flow transformer and diffusion-based refinements. To support broad real-world deployment, the authors assemble PhysX-Mobility, a physically annotated 3D dataset spanning 47 categories and over 2K objects, significantly expanding prior coverage. Extensive experiments show strong generalization across in-the-wild images, ablations validate the representation choices, and simulation-based robotics tasks in MuJoCo demonstrate direct applicability for policy learning and manipulation under contact-rich scenarios. Collectively, PhysX-Anything enables sim-ready physical 3D generation with end-to-end deployability in standard simulators, offering tangible benefits for embodied AI, robotics, and physics-based simulation.

Abstract

3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.

Paper Structure

This paper contains 13 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Given a single real-world image, PhysX-Anything generates a detailed physical 3D object, recovering both its articulation structure and physical properties, and exports URDF and XML files that can be directly deployed in physics engines.
  • Figure 2: Overview of PhysX-Anything . PhysX-Anything conducts a multi-round conversation to produce a physical representation that includes overall information (left) and detailed geometric information for each part (right). Decoding this representation yields high-quality, simulation-ready 3D assets with explicit physical attributes that can be directly used in downstream applications.
  • Figure 3: Comparison of token counts between representations. By adopting a voxel-based representation together with a specialized merging strategy, our method reduces the token count by 193× compared with the original mesh format.
  • Figure 4: Detailed structure of the physical representation decoder. Given the coarse geometry, a controllable flow transformer is employed to generate fine-grained geometric information. The format decoder then combines the overall physical information and the refined geometry to produce assets in six different formats.
  • Figure 5: Qualitative results on the test set of PhysX-Mobility. Compared with other methods, PhysX-Anything generates high-quality, sim-ready physical 3D assets with more faithful geometry, articulation, and physical attributes.
  • ...and 3 more figures