Table of Contents
Fetching ...

Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels

Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, Lingjie Liu

TL;DR

Pixie addresses the challenge of inferring 3D material properties from visual input by learning a generalizable, feed-forward mapping from CLIP-based 3D visual features to a voxelized material field that specifies both a discrete material type and continuous parameters ($E$, $\nu$, $\rho$). The approach uses NeRF-based feature distillation to create a dense $N\times N\times N\times D$ grid, which a 3D U-Net converts into a per-voxel material grid $\hat{\mathcal{M}}_G$, supervised on the richly labeled PixieVerse dataset. By coupling the predicted fields with Gaussian splatting and an MPM physics solver, Pixie achieves fast, realistic 3D simulations and demonstrates substantial improvements over test-time optimization baselines, including zero-shot transfer to real scenes via CLIP priors. The work introduces a large, semi-automatically labeled dataset and highlights the power of visual priors for bridging sim-to-real gaps in physically grounded scene understanding.

Abstract

Inferring the physical properties of 3D scenes from visual information is a critical yet challenging task for creating interactive and realistic virtual worlds. While humans intuitively grasp material characteristics such as elasticity or stiffness, existing methods often rely on slow, per-scene optimization, limiting their generalizability and application. To address this problem, we introduce PIXIE, a novel method that trains a generalizable neural network to predict physical properties across multiple scenes from 3D visual features purely using supervised losses. Once trained, our feed-forward network can perform fast inference of plausible material fields, which coupled with a learned static scene representation like Gaussian Splatting enables realistic physics simulation under external forces. To facilitate this research, we also collected PIXIEVERSE, one of the largest known datasets of paired 3D assets and physic material annotations. Extensive evaluations demonstrate that PIXIE is about 1.46-4.39x better and orders of magnitude faster than test-time optimization methods. By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data. https://pixie-3d.github.io/

Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels

TL;DR

Pixie addresses the challenge of inferring 3D material properties from visual input by learning a generalizable, feed-forward mapping from CLIP-based 3D visual features to a voxelized material field that specifies both a discrete material type and continuous parameters (, , ). The approach uses NeRF-based feature distillation to create a dense grid, which a 3D U-Net converts into a per-voxel material grid , supervised on the richly labeled PixieVerse dataset. By coupling the predicted fields with Gaussian splatting and an MPM physics solver, Pixie achieves fast, realistic 3D simulations and demonstrates substantial improvements over test-time optimization baselines, including zero-shot transfer to real scenes via CLIP priors. The work introduces a large, semi-automatically labeled dataset and highlights the power of visual priors for bridging sim-to-real gaps in physically grounded scene understanding.

Abstract

Inferring the physical properties of 3D scenes from visual information is a critical yet challenging task for creating interactive and realistic virtual worlds. While humans intuitively grasp material characteristics such as elasticity or stiffness, existing methods often rely on slow, per-scene optimization, limiting their generalizability and application. To address this problem, we introduce PIXIE, a novel method that trains a generalizable neural network to predict physical properties across multiple scenes from 3D visual features purely using supervised losses. Once trained, our feed-forward network can perform fast inference of plausible material fields, which coupled with a learned static scene representation like Gaussian Splatting enables realistic physics simulation under external forces. To facilitate this research, we also collected PIXIEVERSE, one of the largest known datasets of paired 3D assets and physic material annotations. Extensive evaluations demonstrate that PIXIE is about 1.46-4.39x better and orders of magnitude faster than test-time optimization methods. By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data. https://pixie-3d.github.io/

Paper Structure

This paper contains 27 sections, 10 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: We introduce Pixie, a novel method for learning simulatable physics of 3D scenes from visual features. Trained on a curated dataset of paired 3D objects and physical material annotations, Pixie can predict both the discrete material types (e.g., rubber) and continuous values including Young's modulus, Poisson's ratio, and density for a variety of materials, including elastic, plastic, and granular. The predicted material parameters can then be coupled with a learned static 3D model such as Gaussian splats and a physics solver such as the Material Point Method (MPM) to produce realistic 3D simulation under physical forces such as gravity and wind.
  • Figure 2: Method Overview. From posed multi-view RGB images of a static scene, Pixie first reconstructs a 3D model with NeRF and distilled CLIP features shen2023distilledfeaturefieldsenable. Then, we voxelize the features into a regular $N \times N \times N \times D$ grid where $N$ is the grid size and $D$ is the CLIP feature dimension. A U-Net neural network dhariwal2021diffusion is trained to map the feature grid to the material field $\hat{\mathcal{M}}_G$ which consists of a discrete material model ID and continuous Young's modulus, Poisson's ratio, and density value for each voxel. Coupled with a separately trained Gaussian splatting model, $\hat{\mathcal{M}}_G$ can be used to simulate physics with a physics solver such as MPM.
  • Figure 3: PixieVerse Dataset Overview. We collect 1624.0 high-quality single-object assets, spanning 10 semantic classes (a), and 5 constitutive material types (b). The dataset is annotated with detailed physical properties including spatially varying discrete material types (b), Young's modulus (c), Poisson's ratio (d), and mass density (e). The left figure shows representative examples from the dataset: organic matter (tree, shrubs, grass, flowers), deformable toys (rubber ducks), sports equipment (sport balls), granular media (sand, snow & mud), and hollow containers (soda cans, metal crates).
  • Figure 4: Main VLM Results. (a) VLM score versus wall-clock time:Pixie is three orders of magnitude faster than previous works while achieving 1.46-4.39x improvement in realism. Test-time optimization methods are run with varying numbers of epochs i.e., $1,25, 50$ for DreamPhysics and $1, 2, 5$ for OmniPhysGS while inference methods are only run once. (b) Per-class VLM score: Our method leads on most object classes. Standard errors are also included.
  • Figure 5: Qualitative comparison on synthetic scenes. We visualized the predicted material class and $E$ predictions (left, right respectively) for Pixie and Nerf2Physics, $E$ for DreamPhysics (right), and the plasticity and hyperelastic function classes predicted by OmniPhysGS. Pixie produces stable, physically plausible motion while DreamPhysics remains overly stiff due to inaccurate fine-grained $E$ prediction or too high $E$ (e.g., see tree (C)), OmniPhysGS collapses under load due to unrealistic combination of plasticity and hyperelastic functions, and NeRF2Physics exhibits noisy artifacts. Please see https://pixie-3d.github.io/ for the videos.
  • ...and 13 more figures