Table of Contents
Fetching ...

Identifying Terrain Physical Parameters from Vision -- Towards Physical-Parameter-Aware Locomotion and Navigation

Jiaqi Chen, Jonas Frey, Ruyi Zhou, Takahiro Miki, Georg Martius, Marco Hutter

TL;DR

This work tackles the challenge of inferring terrain physical properties from vision to enable physically-parameter-aware locomotion and navigation. It introduces a two-stage, self-supervised framework where a simulation-trained physical decoder predicts per-foot $friction$ and $stiffness$, labelings real images to train a dense visual predictor that outputs per-pixel terrain properties. Anomaly detection via a two-component Gaussian Mixture Model provides reliability masks, while online training through a Mission Graph enables continual adaptation to new environments. Across simulation, real-world tests, and digital-twin experiments, the approach demonstrates improved per-foot parameter estimation and robust dense visual predictions, supporting effective sim-to-real transfer of physical-parameter-aware policies.

Abstract

Identifying the physical properties of the surrounding environment is essential for robotic locomotion and navigation to deal with non-geometric hazards, such as slippery and deformable terrains. It would be of great benefit for robots to anticipate these extreme physical properties before contact; however, estimating environmental physical parameters from vision is still an open challenge. Animals can achieve this by using their prior experience and knowledge of what they have seen and how it felt. In this work, we propose a cross-modal self-supervised learning framework for vision-based environmental physical parameter estimation, which paves the way for future physical-property-aware locomotion and navigation. We bridge the gap between existing policies trained in simulation and identification of physical terrain parameters from vision. We propose to train a physical decoder in simulation to predict friction and stiffness from multi-modal input. The trained network allows the labeling of real-world images with physical parameters in a self-supervised manner to further train a visual network during deployment, which can densely predict the friction and stiffness from image data. We validate our physical decoder in simulation and the real world using a quadruped ANYmal robot, outperforming an existing baseline method. We show that our visual network can predict the physical properties in indoor and outdoor experiments while allowing fast adaptation to new environments.

Identifying Terrain Physical Parameters from Vision -- Towards Physical-Parameter-Aware Locomotion and Navigation

TL;DR

This work tackles the challenge of inferring terrain physical properties from vision to enable physically-parameter-aware locomotion and navigation. It introduces a two-stage, self-supervised framework where a simulation-trained physical decoder predicts per-foot and , labelings real images to train a dense visual predictor that outputs per-pixel terrain properties. Anomaly detection via a two-component Gaussian Mixture Model provides reliability masks, while online training through a Mission Graph enables continual adaptation to new environments. Across simulation, real-world tests, and digital-twin experiments, the approach demonstrates improved per-foot parameter estimation and robust dense visual predictions, supporting effective sim-to-real transfer of physical-parameter-aware policies.

Abstract

Identifying the physical properties of the surrounding environment is essential for robotic locomotion and navigation to deal with non-geometric hazards, such as slippery and deformable terrains. It would be of great benefit for robots to anticipate these extreme physical properties before contact; however, estimating environmental physical parameters from vision is still an open challenge. Animals can achieve this by using their prior experience and knowledge of what they have seen and how it felt. In this work, we propose a cross-modal self-supervised learning framework for vision-based environmental physical parameter estimation, which paves the way for future physical-property-aware locomotion and navigation. We bridge the gap between existing policies trained in simulation and identification of physical terrain parameters from vision. We propose to train a physical decoder in simulation to predict friction and stiffness from multi-modal input. The trained network allows the labeling of real-world images with physical parameters in a self-supervised manner to further train a visual network during deployment, which can densely predict the friction and stiffness from image data. We validate our physical decoder in simulation and the real world using a quadruped ANYmal robot, outperforming an existing baseline method. We show that our visual network can predict the physical properties in indoor and outdoor experiments while allowing fast adaptation to new environments.
Paper Structure (24 sections, 3 equations, 12 figures, 2 tables)

This paper contains 24 sections, 3 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Overview of the two-stage self-supervised terrain physical parameter learning framework. A physical decoder in twin structure is trained in simulation to predict simulated friction and stiffness parameters per foot. The physical decoder transfers to the real world, where it provides self-supervised labels (within the supervision mask) to train a visual network on real-world image data. In the training stage, the visual network is trained with weak supervision only on the foothold pixels. In the inference phase, the visual pipeline processes all pixel features within an image and outputs the corresponding dense prediction of the simulated physical parameters with a confidence mask.
  • Figure 2: Physical decoder architecture in the form of a twin network. Friction and stiffness are predicted by each separate network. The yellow trapezoidal blocks are .
  • Figure 3: Visual network architecture and losses used for training. The decoder is in an encoder-decoder structure for the simultaneous detection and physical parameters regression. Friction and stiffness values of each pixel feature in the input image are predicted at the same time.
  • Figure 4: Evolution of the reconstruction loss distribution with the increase of training steps. The reconstruction loss distribution is unimodal before training, while changed to a bimodal distribution during training. Yellow-orange indicates In-Distribution (ID) data, while purple-blue is for Out-Of-Distribution (OOD) data.
  • Figure 5: Online training framework adapted from Frey2023. The inference task extracts features and outputs masked dense predictions. The self-supervision task contains the Mission Graph to store paired input features and labels provided by the physical decoder. The learning task performs continuous training of the visual network in the learning thread.
  • ...and 7 more figures