When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

Jessie Yuan; Yilin Wu; Andrea Bajcsy

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

Jessie Yuan, Yilin Wu, Andrea Bajcsy

Abstract

Policy steering is an emerging way to adapt robot behaviors at deployment-time: a learned verifier analyzes low-level action samples proposed by a pre-trained policy (e.g., diffusion policy) and selects only those aligned with the task. While Vision-Language Models (VLMs) are promising general-purpose verifiers due to their reasoning capabilities, existing frameworks often assume these models are well-calibrated. In practice, the overconfident judgment from VLM can degrade the steering performance under both high-level semantic uncertainty in task specifications and low-level action uncertainty or incapability of the pre-trained policy. We propose uncertainty-aware policy steering (UPS), a framework that jointly reasons about semantic task uncertainty and low-level action feasibility, and selects an uncertainty resolution strategy: execute a high-confidence action, clarify task ambiguity via natural language queries, or ask for action interventions to correct the low-level policy when it is deemed incapable at the task. We leverage conformal prediction to calibrate the composition of the VLM and the pre-trained base policy, providing statistical assurances that the verifier selects the correct strategy. After collecting interventions during deployment, we employ residual learning to improve the capability of the pre-trained policy, enabling the system to learn continually but with minimal expensive human feedback. We demonstrate our framework through experiments in simulation and on hardware, showing that UPS can disentangle confident, ambiguous, and incapable scenarios and minimizes expensive user interventions compared to uncalibrated baselines and prior human- or robot-gated continual learning approaches. Videos can be found at https://jessie-yuan.github.io/ups/

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

Abstract

Paper Structure (26 sections, 1 theorem, 24 equations, 34 figures, 7 tables)

This paper contains 26 sections, 1 theorem, 24 equations, 34 figures, 7 tables.

Introduction
Related Work
Problem Formulation
Approach: Uncertainty-Aware Policy Steering
VLM-in-the-loop Policy Steering
VLM Verifier Uncertainty Quantification
Resolving Uncertainty: From High-level Clarifications to Low-Level Continual Learning
Simulation & Hardware Experiments
Uncertainty-Aware Steering
Our Score Function Balances Coverage & Clarification
Uncertainty Improves Policy Steering Performance
Continual Learning
Semantic-level UQ Minimizes Human Feedback
UPS Improves Re-Deployment Performance
Conclusion & Limitations
...and 11 more sections

Key Result

Theorem 1

Let $D_{\text{calib}}= \{ \{(x_i^n, \mathcal{Y}_i^n)\}_{i=0}^M \}_{n=1}^N$ be the calibration dataset with non-conformity score: and let $\hat{q}$ be the $\frac{\lceil (N+1)(1-\varepsilon)\rceil}{N}$--quantile of $\{\kappa^n\}^N_{n=1}$. For a test point $x_i^{\text{test}}$ at any time when the VLM verifier is called $i \in [M]$, where the correct set of labels is $\mathcal{Y}^*_i \subseteq\mathca

Figures (34)

Figure 1: Uncertainty-Aware Policy Steering. Our framework calibrates the VLM verifier used for policy steering via conformal prediction. This enables the VLM to select an appropriate way to resolve uncertainty, from querying the end-user in natural language to asking to re-train the low-level control policy.
Figure 2: Outcome Prediction & Narration. The policy and the world model are interleaved to predict long-horizon outcomes induced by the low-level policy. Decoded observations are fed into a VLM which narrates the outcomes in text.
Figure 3: Uncertainty Quantification Results: Hardware. Combination of Vanilla, CoT and Bayesian Intent (Ours) models for UQ. Dashed lines are either the target coverage rate ($1-\epsilon = 0.85$), clarification rate, or set size.
Figure 4: Uncertainty Quantification Results: Simulation. We compare the combination of Vanilla, CoT and Bayesian Intent (Ours) models for UQ. Dashed lines are either the target coverage ($1-\epsilon = 0.85$), clarification rate, or set size.
Figure 5: Success Rates Pre- and Post-Continual Learning: Hardware and Simulation. We deploy the robot with 20 straightforward (left) and 20 ambiguous (right) task instructions. We average the success rate over 20 trials for each scenario. Our approach solicits data in a way which maximizes the final success rate after residual policy training, compared to human- and robot-gated baselines.
...and 29 more figures

Theorems & Definitions (2)

Theorem 1: Verification Coverage Guarantee
proof

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

Abstract

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

Authors

Abstract

Table of Contents

Key Result

Figures (34)

Theorems & Definitions (2)