You've Got to Feel It To Believe It: Multi-Modal Bayesian Inference for Semantic and Property Prediction

Parker Ewen; Hao Chen; Yuzhen Chen; Anran Li; Anup Bagali; Gitesh Gunjal; Ram Vasudevan

You've Got to Feel It To Believe It: Multi-Modal Bayesian Inference for Semantic and Property Prediction

Parker Ewen, Hao Chen, Yuzhen Chen, Anran Li, Anup Bagali, Gitesh Gunjal, Ram Vasudevan

TL;DR

The paper addresses robust scene understanding under uncertainty by jointly estimating semantic labels and physical properties (e.g., friction) from vision and tactile sensing. It introduces a multi-modal Bayesian framework that uses conjugate priors, notably Dirichlet and Dirichlet Normal-Inverse-Gamma, to enable closed-form online updates and a moment-matching approximation to keep the Gaussian-mixture property estimates tractable. Key contributions include (i) online joint filtering of visual and tactile data for semantic and property estimation, (ii) an approximate posterior projection via the method of moments for a DNIG product, and (iii) hardware demonstrations (friction-aware gait switching, affordance-based door-opening) and an open-source C++/ROS implementation. The approach yields improved semantic accuracy and property estimation over vision-only baselines and supports risk-aware planning in challenging terrains, with potential for broader multi-modal active perception.

Abstract

Robots must be able to understand their surroundings to perform complex tasks in challenging environments and many of these complex tasks require estimates of physical properties such as friction or weight. Estimating such properties using learning is challenging due to the large amounts of labelled data required for training and the difficulty of updating these learned models online at run time. To overcome these challenges, this paper introduces a novel, multi-modal approach for representing semantic predictions and physical property estimates jointly in a probabilistic manner. By using conjugate pairs, the proposed method enables closed-form Bayesian updates given visual and tactile measurements without requiring additional training data. The efficacy of the proposed algorithm is demonstrated through several hardware experiments. In particular, this paper illustrates that by conditioning semantic classifications on physical properties, the proposed method quantitatively outperforms state-of-the-art semantic classification methods that rely on vision alone. To further illustrate its utility, the proposed method is used in several applications including to represent affordance-based properties probabilistically and a challenging terrain traversal task using a legged robot. In the latter task, the proposed method represents the coefficient of friction of the terrain probabilistically, which enables the use of an on-line risk-aware planner that switches the legged robot from a dynamic gait to a static, stable gait when the expected value of the coefficient of friction falls below a given threshold. Videos of these case studies as well as the open-source C++ and ROS interface can be found at https://roahmlab.github.io/multimodal_mapping/.

You've Got to Feel It To Believe It: Multi-Modal Bayesian Inference for Semantic and Property Prediction

TL;DR

Abstract

Paper Structure (27 sections, 3 theorems, 24 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 27 sections, 3 theorems, 24 equations, 10 figures, 2 tables, 2 algorithms.

Introduction
Related Works
Semantic Prediction
Property Estimation
Gap in the Literature
Preliminaries
Conjugate Pairs
Dirichlet Conjugate Prior
Dirichlet Normal-Inverse-Gamma Conjugate Prior
Method of Moments
Revisiting the Dirichlet Normal-Inverse-Gamma Product
Algorithm Overview
Vision-Based Estimation
Tactile-Based Estimation
Implementation
...and 12 more sections

Key Result

Theorem 1

Let $\mathcal{Z} = \{z_1, \dots, z_n\}$ be a set of measurements drawn from a Categorical distribution, $p(z_j = i | \boldsymbol{\theta})$, and let the prior for $\boldsymbol{\theta}$ be a Dirichlet distribution, $p(\boldsymbol{\theta}|\boldsymbol{\alpha})$. The posterior computed using Bayes' theor where and $1\{z_i = j\}$ is equal to $1$ when the expected class of measurement $z_i$ is class $j$

Figures (10)

Figure 1: The method proposed in this paper jointly estimates semantic classifications and physical properties by combining visual and tactile data into a single semantic mapping framework. RGB-D images are used to build a metric-semantic map that iteratively estimates semantic labels. A property measurement is taken which in turn updates both the semantic class predictions and physical property estimates. In the depicted example, the robot is unsure if the terrain in front of it is snow or ice from vision measurements alone (prior estimates) which dramatically affects the coefficient of friction and the associated gait that can be applied to safely traverse the terrain. The robot uses a tactile sensor attached to its manipulator to update its coefficient of friction estimation (posterior estimates), which then enables it to change gaits to cross the ice safely.
Figure 2: A flow diagram illustrating Algorithm \ref{['alg:pred']}. A semantic classification algorithm predicts pixel-wise classes from RGB images that are then projected into a common mapping frame using the aligned depth image, camera intrinsics, and estimated camera pose. This semantic point cloud is used to build a metric-semantic map. When a property measurement is taken, Algorithm \ref{['alg:moments']} is used to update the semantic and property estimates.
Figure 3: Results for a single simulated experiment. The image (a) and ground truth semantic labels (b) are from the Dense Material Segmentation Dataset upchurch2022materials. The semantic segmentation predictions (c) do not classify parts of the desk as wood. A friction measurement is simulated using the ground-truth semantic label and Table \ref{['table:friction']} and is sampled from the pixel highlighted by the red cross in the RGB image. Algorithm \ref{['alg:moments']} then computes the correct posterior semantic label (d).
Figure 4: Results for a single simulated experiment in the Habitat simulator savva2019habitat. a) The input RGB image with ground truth semantic labels is shown. On the right-hand side are the outputs of the Selmap ewen2022these baseline and our proposed approach using three different semantic segmentation networks. b) The incorrect semantic labelling is reflected in the Selmap implementation as the semantic segmentation networks all predict incorrect classifications in various regions in the scene. By exploiting tactile measurements taken at the locations denoted by the red 'X', c) our proposed approach is able to correct these erroneous semantic predictions.
Figure 5: A semantic segmentation task is shown and the proposed method is compared against the semantic mapping approach from ewen2022these which is called Selmap. The expected semantic class is shown. a) The input image with measurement locations shown using an X and associated b) ground truth semantic labels are provided. Both Selmap and our approach use the same SegFormer + FastSAM pre-trained semantic segmentation network specified in Section \ref{['sec:implementation']}. This network incorrectly predicts the semantic labels for several regions within the scene. c) This incorrect labelling is reflected in the Selmap implementation as the visual-based semantic mapping approach is unable to correct these erroneous predictions. By exploiting a tactile sensing modality, d) our approach is able to correct the erroneous semantic predictions and correctly predict the semantic labels of the objects within the scene.
...and 5 more figures

Theorems & Definitions (3)

Theorem 1
Theorem 2
Theorem 3

You've Got to Feel It To Believe It: Multi-Modal Bayesian Inference for Semantic and Property Prediction

TL;DR

Abstract

You've Got to Feel It To Believe It: Multi-Modal Bayesian Inference for Semantic and Property Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (3)