Table of Contents
Fetching ...

Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments

Yifan Xu, Vineet Kamat, Carol Menassa

TL;DR

This work tackles the problem of reliable, open-vocabulary place recognition for assistive robots in indoor built environments, where language-based interfaces can hallucinate and human instructions may be ambiguous. It introduces Seeing with Partial Certainty (SwPC), which applies conformal prediction to produce prediction sets with a coverage guarantee $P(Y_{test} \in \mathcal{C}(X_{test})) \ge 1-\alpha$, using non-conformity scores $s_i = 1 - f(X_i, Y_i)$ and a calibrated threshold $\hat{q}$. Theoretical backing is provided by Theorem 1, ensuring the conformal-calibrated sets maintain the desired confidence level, while the framework remains model-agnostic and does not require fine-tuning of the VLMs. Empirical results on the Matterport3D dataset demonstrate that SwPC increases success rates and reduces the need for human intervention by offering flexible prediction-set sizes that adapt to the calibrated uncertainty, supporting safer and more efficient indoor navigation for people with disabilities.

Abstract

In assistive robotics serving people with disabilities (PWD), accurate place recognition in built environments is crucial to ensure that robots navigate and interact safely within diverse indoor spaces. Language interfaces, particularly those powered by Large Language Models (LLM) and Vision Language Models (VLM), hold significant promise in this context, as they can interpret visual scenes and correlate them with semantic information. However, such interfaces are also known for their hallucinated predictions. In addition, language instructions provided by humans can also be ambiguous and lack precise details about specific locations, objects, or actions, exacerbating the hallucination issue. In this work, we introduce Seeing with Partial Certainty (SwPC) - a framework designed to measure and align uncertainty in VLM-based place recognition, enabling the model to recognize when it lacks confidence and seek assistance when necessary. This framework is built on the theory of conformal prediction to provide statistical guarantees on place recognition while minimizing requests for human help in complex indoor environment settings. Through experiments on the widely used richly-annotated scene dataset Matterport3D, we show that SwPC significantly increases the success rate and decreases the amount of human intervention required relative to the prior art. SwPC can be utilized with any VLMs directly without requiring model fine-tuning, offering a promising, lightweight approach to uncertainty modeling that complements and scales alongside the expanding capabilities of foundational models.

Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments

TL;DR

This work tackles the problem of reliable, open-vocabulary place recognition for assistive robots in indoor built environments, where language-based interfaces can hallucinate and human instructions may be ambiguous. It introduces Seeing with Partial Certainty (SwPC), which applies conformal prediction to produce prediction sets with a coverage guarantee , using non-conformity scores and a calibrated threshold . Theoretical backing is provided by Theorem 1, ensuring the conformal-calibrated sets maintain the desired confidence level, while the framework remains model-agnostic and does not require fine-tuning of the VLMs. Empirical results on the Matterport3D dataset demonstrate that SwPC increases success rates and reduces the need for human intervention by offering flexible prediction-set sizes that adapt to the calibrated uncertainty, supporting safer and more efficient indoor navigation for people with disabilities.

Abstract

In assistive robotics serving people with disabilities (PWD), accurate place recognition in built environments is crucial to ensure that robots navigate and interact safely within diverse indoor spaces. Language interfaces, particularly those powered by Large Language Models (LLM) and Vision Language Models (VLM), hold significant promise in this context, as they can interpret visual scenes and correlate them with semantic information. However, such interfaces are also known for their hallucinated predictions. In addition, language instructions provided by humans can also be ambiguous and lack precise details about specific locations, objects, or actions, exacerbating the hallucination issue. In this work, we introduce Seeing with Partial Certainty (SwPC) - a framework designed to measure and align uncertainty in VLM-based place recognition, enabling the model to recognize when it lacks confidence and seek assistance when necessary. This framework is built on the theory of conformal prediction to provide statistical guarantees on place recognition while minimizing requests for human help in complex indoor environment settings. Through experiments on the widely used richly-annotated scene dataset Matterport3D, we show that SwPC significantly increases the success rate and decreases the amount of human intervention required relative to the prior art. SwPC can be utilized with any VLMs directly without requiring model fine-tuning, offering a promising, lightweight approach to uncertainty modeling that complements and scales alongside the expanding capabilities of foundational models.
Paper Structure (14 sections, 7 equations, 4 figures)

This paper contains 14 sections, 7 equations, 4 figures.

Figures (4)

  • Figure 1: Illustration of conformal prediction (CP).
  • Figure 2: Illustration of SwPC pipeline.
  • Figure 3: Comparison of task success rate vs average prediction set size (Left) and vs. human help rate (Right) of Matterport3D dataset averaged over the three settings. 1504 rooms are evaluated for each method. $\alpha$ is varied from 0 to 1 for CP. Binary and No Help are not shown on the left since prediction sets are not provided.
  • Figure 4: Qualitative comparison between Prompt Set and CP Set over Matterport3D. From the qualitative result, the overlapping area between ground truth and CP prediction is greater than the baseline Prompt Set. Binary and No Help Set are not shown on the left since prediction sets are not provided.