Table of Contents
Fetching ...

Physical Property Understanding from Language-Embedded Feature Fields

Albert J. Zhai, Yuan Shen, Emily Y. Chen, Gloria X. Wang, Xinlei Wang, Sheng Wang, Kaiyu Guan, Shenlong Wang

TL;DR

The paper introduces NeRF2Physics, a training-free framework that predicts dense physical properties from image collections by constructing a language-embedded 3D feature field. It combines NeRF-derived geometry with CLIP-based per-point features and leverages LLMs to generate a material dictionary, enabling zero-shot regression of properties such as mass, friction, and hardness. The approach includes an object-level aggregation step using LLMS to estimate surface thickness for volumetric properties, and is validated on ABO-500 for mass as well as real-world datasets for friction and hardness, surpassing several baselines and demonstrating robust, annotation-free reasoning in open-world objects. The work advances open-world physical-property understanding with practical implications for digital twins, robotics, and agriculture, and highlights the potential of integrating vision-language models with geometric representations for physics-aware perception.

Abstract

Can computers perceive the physical properties of objects solely through vision? Research in cognitive science and vision science has shown that humans excel at identifying materials and estimating their physical properties based purely on visual appearance. In this paper, we present a novel approach for dense prediction of the physical properties of objects using a collection of images. Inspired by how humans reason about physics through vision, we leverage large language models to propose candidate materials for each object. We then construct a language-embedded point cloud and estimate the physical properties of each 3D point using a zero-shot kernel regression approach. Our method is accurate, annotation-free, and applicable to any object in the open world. Experiments demonstrate the effectiveness of the proposed approach in various physical property reasoning tasks, such as estimating the mass of common objects, as well as other properties like friction and hardness.

Physical Property Understanding from Language-Embedded Feature Fields

TL;DR

The paper introduces NeRF2Physics, a training-free framework that predicts dense physical properties from image collections by constructing a language-embedded 3D feature field. It combines NeRF-derived geometry with CLIP-based per-point features and leverages LLMs to generate a material dictionary, enabling zero-shot regression of properties such as mass, friction, and hardness. The approach includes an object-level aggregation step using LLMS to estimate surface thickness for volumetric properties, and is validated on ABO-500 for mass as well as real-world datasets for friction and hardness, surpassing several baselines and demonstrating robust, annotation-free reasoning in open-world objects. The work advances open-world physical-property understanding with practical implications for digital twins, robotics, and agriculture, and highlights the potential of integrating vision-language models with geometric representations for physics-aware perception.

Abstract

Can computers perceive the physical properties of objects solely through vision? Research in cognitive science and vision science has shown that humans excel at identifying materials and estimating their physical properties based purely on visual appearance. In this paper, we present a novel approach for dense prediction of the physical properties of objects using a collection of images. Inspired by how humans reason about physics through vision, we leverage large language models to propose candidate materials for each object. We then construct a language-embedded point cloud and estimate the physical properties of each 3D point using a zero-shot kernel regression approach. Our method is accurate, annotation-free, and applicable to any object in the open world. Experiments demonstrate the effectiveness of the proposed approach in various physical property reasoning tasks, such as estimating the mass of common objects, as well as other properties like friction and hardness.
Paper Structure (38 sections, 4 equations, 18 figures, 9 tables)

This paper contains 38 sections, 4 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: Estimating physical properties from images. Humans can predict physical properties of objects by associating visual appearances with grounded knowledge about materials. We propose to equip computers with this capability by combining language-embedded feature fields with LLM-based material reasoning.
  • Figure 2: Overview of NeRF2Physics. Given a collection of posed images, we first train a neural radiance field to capture the 3D geometry of the scene. Then, we fuse vision-language features into a point cloud extracted from the field. Next, we use a captioning model to provide a text description of the scene and prompt an LLM to produce a dictionary of possible materials in the scene, along with their physical properties. From here, physical properties can be estimated at any query point using zero-shot CLIP-based kernel regression within the dictionary. The kernel regression process is illustrated in more detail in Fig. \ref{['fig:zeroshot_overview']}.
  • Figure 3: Overview of zero-shot physical property prediction. To predict physical property values from the language-embedded point cloud, we extract CLIP features and perform kernel regression using the predicted dictionary of materials and their properties. To predict the total mass of an object, we then integrate the predicted mass density across cuboids on the surface of the object. The thickness of each cuboid is estimated in the same way as the other physical properties.
  • Figure 4: Example visualizations. We visualize input images from ABO-500 along with our model's CLIP feature PCA components, zero-shot material segmentation, and predicted mass density. Our model makes reasonable predictions of materials across different parts of objects in 3D, allowing for grounded predictions of physical properties.
  • Figure 5: Example predictions of different physical properties. We visualize predictions of hardness and friction on objects from our own collected dataset. For evaluation purposes, Shore A and Shore D hardness was combined into the same scale. The friction coefficient represents the coefficient of kinetic friction against a fabric surface. We quantitatively evaluate these predictions using a set of sparse per-point measurements (see Sec. \ref{['sec:hf_exp']}).
  • ...and 13 more figures