Table of Contents
Fetching ...

The Neural Compass: Probabilistic Relative Feature Fields for Robotic Search

Gabriele Somaschini, Adrian Röfer, Abhinav Valada

TL;DR

This work proposes ProReFF, a feature field model trained to predict relative distributions of features obtained from pre-trained vision language models and introduces a learning-based strategy that enables training from unlabeled and potentially contradictory data by aligning inconsistent observations into a coherent relative distribution.

Abstract

Object co-occurrences provide a key cue for finding objects successfully and efficiently in unfamiliar environments. Typically, one looks for cups in kitchens and views fridges as evidence of being in a kitchen. Such priors have also been exploited in artificial agents, but they are typically learned from explicitly labeled data or queried from language models. It is still unclear whether these relations can be learned implicitly from unlabeled observations alone. In this work, we address this problem and propose ProReFF, a feature field model trained to predict relative distributions of features obtained from pre-trained vision language models. In addition, we introduce a learning-based strategy that enables training from unlabeled and potentially contradictory data by aligning inconsistent observations into a coherent relative distribution. For the downstream object search task, we propose an agent that leverages predicted feature distributions as a semantic prior to guide exploration toward regions with a high likelihood of containing the object. We present extensive evaluations demonstrating that ProReFF captures meaningful relative feature distributions in natural scenes and provides insight into the impact of our proposed alignment step. We further evaluate the performance of our search agent in 100 challenges in the Matterport3D simulator, comparing with feature-based baselines and human participants. The proposed agent is 20% more efficient than the strongest baseline and achieves up to 80% of human performance.

The Neural Compass: Probabilistic Relative Feature Fields for Robotic Search

TL;DR

This work proposes ProReFF, a feature field model trained to predict relative distributions of features obtained from pre-trained vision language models and introduces a learning-based strategy that enables training from unlabeled and potentially contradictory data by aligning inconsistent observations into a coherent relative distribution.

Abstract

Object co-occurrences provide a key cue for finding objects successfully and efficiently in unfamiliar environments. Typically, one looks for cups in kitchens and views fridges as evidence of being in a kitchen. Such priors have also been exploited in artificial agents, but they are typically learned from explicitly labeled data or queried from language models. It is still unclear whether these relations can be learned implicitly from unlabeled observations alone. In this work, we address this problem and propose ProReFF, a feature field model trained to predict relative distributions of features obtained from pre-trained vision language models. In addition, we introduce a learning-based strategy that enables training from unlabeled and potentially contradictory data by aligning inconsistent observations into a coherent relative distribution. For the downstream object search task, we propose an agent that leverages predicted feature distributions as a semantic prior to guide exploration toward regions with a high likelihood of containing the object. We present extensive evaluations demonstrating that ProReFF captures meaningful relative feature distributions in natural scenes and provides insight into the impact of our proposed alignment step. We further evaluate the performance of our search agent in 100 challenges in the Matterport3D simulator, comparing with feature-based baselines and human participants. The proposed agent is 20% more efficient than the strongest baseline and achieves up to 80% of human performance.
Paper Structure (13 sections, 18 equations, 7 figures, 2 tables)

This paper contains 13 sections, 18 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of ProReFF. We train a relative feature field, which predicts a mean embedding and a variance, given a query embedding $q$ and a relative offset $v$. This network is trained in an unsupervised manner from feature point cloud observations. To enable this training, we introduce a learned data alignment model. We demonstrate the utility of ProReFF for object search by using it to infer distributions of features around a target object. These distributions are compared with the agent's current observations, and a choice is made between following the current observations or inferring further features.
  • Figure 2: Depiction of the ambiguity problem in the data. Observing the same query feature (red dot), from two different locations can lead to contradictory target features (pink dot) given the same offset vector. In the schematic, we illustrate the effect of the learned alignment on this problem. In two separate observation instances, the alignment $R$ rotates the observations such that the query contradiction is resolved.
  • Figure 3: UMAP visualization of ground truth clusters embeddings (left), base model predictions (center), and aligned model predictions (right). Cluster centroids are marked as circled points. The base model exhibits severe mode collapse, whereas the aligned model retains substantially more of the semantic diversity around the query embedding. This demonstrates the effectiveness of the Alignment Network at resolving conflicts in the training data.
  • Figure 4: Distribution of pointwise cosine similarities between ground truth and predicted embeddings. base: predictor trained without Alignment Network. aligned: predictor trained with Alignment Network, tested with positions rotated into canonical frame. aligned original: same aligned model tested with unrotated scene positions, showing frame-dependency of the learned predictions.
  • Figure 5: Cluster-based distance between ground truth and predicted embedding distributions. Lower values indicate better preservation of semantic structure. base: predictor without alignment. Aligned: predictor with Alignment Network using rotated input positions. Aligned original: same model using unrotated scene positions. Aligned sphere: same model with uniformly sampled spherical positions. Baselines: random (random embeddings) and mean (all embeddings set to scene mean).
  • ...and 2 more figures