Deployment-Time Reliability of Learned Robot Policies

Christopher Agia

Deployment-Time Reliability of Learned Robot Policies

Christopher Agia

Abstract

Recent advances in learning-based robot manipulation have produced policies with remarkable capabilities. Yet, reliability at deployment remains a fundamental barrier to real-world use, where distribution shift, compounding errors, and complex task dependencies collectively undermine system performance. This dissertation investigates how the reliability of learned robot policies can be improved at deployment time through mechanisms that operate around them. We develop three complementary classes of deployment-time mechanisms. First, we introduce runtime monitoring methods that detect impending failures by identifying inconsistencies in closed-loop policy behavior and deviations in task progress, without requiring failure data or task-specific supervision. Second, we propose a data-centric framework for policy interpretability that traces deployment-time successes and failures to influential training demonstrations using influence functions, enabling principled diagnosis and dataset curation. Third, we address reliable long-horizon task execution by formulating policy coordination as the problem of estimating and maximizing the success probability of behavior sequences, and we extend this formulation to open-ended, language-specified tasks through feasibility-aware task planning. By centering on core challenges of deployment, these contributions advance practical foundations for the reliable, real-world use of learned robot policies. Continued progress on these foundations will be essential for enabling trustworthy and scalable robot autonomy in the future.

Deployment-Time Reliability of Learned Robot Policies

Abstract

Paper Structure (184 sections, 5 theorems, 66 equations, 37 figures, 10 tables, 4 algorithms)

This paper contains 184 sections, 5 theorems, 66 equations, 37 figures, 10 tables, 4 algorithms.

Introduction
Background
Thesis Outline
Policy Monitoring and Interpretability (\ref{['part:1']})
Policy Coordination and Planning (\ref{['part:2']})
Publications
Policy Monitoring and Interpretability
Monitoring Policies for Runtime Failure Detection
Introduction
Related Work
Problem Setup
Failure Detection
Policy Formulation
Proposed Approach: Sentinel
STAC: Detecting Erratic Failures with Temporal Consistency
...and 169 more sections

Key Result

Proposition 1

Let $P_\tau$ denote the distribution of success trajectories in the validation dataset $\mathcal{D}_\tau = \{\tau^i\}_{i=1}^M \overset{\textup{iid}}{\sim} P_\tau$. Then, the FPR---the probability of raising a false alarm at any point during an i.i.d. test trajectory $\tau \sim P_\tau$ of length $H'

Figures (37)

Figure 1: We present Sentinel, a runtime monitor that detects unknown failures of generative robot policies at deployment time. Constructing Sentinel requires only a set of successful policy rollouts and a description of the task, from which it detects diverse failures by monitoring (a) the temporal consistency of action-chunk distributions generated by the policy and (b) the task progress of the robot(s) through video QA with Vision-Language Models. More details can be found on the Sentinel website: https://sites.google.com/stanford.edu/sentinel.
Figure 2: Action sequence prediction overlap during policy rollout.
Figure 3: Overview of Sentinel. The images depict a policy rollout for timesteps $t=1,\ldots, T$. Temporal Consistency Detector: At each timestep $t$, the state $s_t$ is passed to the generative policy to obtain action distributions $\pi_t$ between which statistical distances $\hat{D}_t$ are computed to measure temporal consistency. The statistical distances are summed up to the current timestep $T$ (as in Eq. \ref{['eq:sentinel-cum-score-fn']}) and thresholded by $\gamma$ to detect policy failure. Vision-Language Model (VLM) Detector: The VLM classifies whether the policy is failing to make progress on its task given a video up to timestep $T$ and a description of the task. Execution stops if either detector raises a warning.
Figure 4: Temporal consistency scores grow faster when the policy fails. Error bars indicate the 5-th and 95-th score quantiles.
Figure 5: Detecting failures in PushT. Left: Our failure detector (STAC) which measures the temporal consistency of a generative policy outperforms several families of out-of-distribution detectors. Right: The best performance comes from measuring temporal consistency with statistical distance functions; augmenting baselines with temporal consistency does not always increase their performance.
...and 32 more figures

Theorems & Definitions (10)

Proposition 1: STAC has low FPR
Definition 1: Performance Influence
Definition 2: Action Influence
Proposition 2
Theorem 1: Adapted from Thm. D.1 in angelopoulos2021gentle
Proposition 3: STAC has low FPR
proof
proof
Proposition 4
proof

Deployment-Time Reliability of Learned Robot Policies

Abstract

Deployment-Time Reliability of Learned Robot Policies

Authors

Abstract

Table of Contents

Key Result

Figures (37)

Theorems & Definitions (10)