Foundational World Models Accurately Detect Bimanual Manipulator Failures

Isaac R. Ward; Michelle Ho; Houjun Liu; Aaron Feldman; Joseph Vincent; Liam Kruse; Sean Cheong; Duncan Eddy; Mykel J. Kochenderfer; Mac Schwager

Foundational World Models Accurately Detect Bimanual Manipulator Failures

Isaac R. Ward, Michelle Ho, Houjun Liu, Aaron Feldman, Joseph Vincent, Liam Kruse, Sean Cheong, Duncan Eddy, Mykel J. Kochenderfer, Mac Schwager

TL;DR

This work trains a probabilistic, history informed, world model within the compressed latent space of a pretrained vision foundation model (NVIDIA's Cosmos Tokenizer) and outputs uncertainty estimates alongside its predictions that serve as non-conformity scores within a conformal prediction framework.

Abstract

Deploying visuomotor robots at scale is challenging due to the potential for anomalous failures to degrade performance, cause damage, or endanger human life. Bimanual manipulators are no exception; these robots have vast state spaces comprised of high-dimensional images and proprioceptive signals. Explicitly defining failure modes within such state spaces is infeasible. In this work, we overcome these challenges by training a probabilistic, history informed, world model within the compressed latent space of a pretrained vision foundation model (NVIDIA's Cosmos Tokenizer). The model outputs uncertainty estimates alongside its predictions that serve as non-conformity scores within a conformal prediction framework. We use these scores to develop a runtime monitor, correlating periods of high uncertainty with anomalous failures. To test these methods, we use the simulated Push-T environment and the Bimanual Cable Manipulation dataset, the latter of which we introduce in this work. This new dataset features trajectories with multiple synchronized camera views, proprioceptive signals, and annotated failures from a challenging data center maintenance task. We benchmark our methods against baselines from the anomaly detection and out-of-distribution detection literature, and show that our approach considerably outperforms statistical techniques. Furthermore, we show that our approach requires approximately one twentieth of the trainable parameters as the next-best learning-based approach, yet outperforms it by 3.8% in terms of failure detection rate, paving the way toward safely deploying manipulator robots in real-world environments where reliability is non-negotiable.

Foundational World Models Accurately Detect Bimanual Manipulator Failures

TL;DR

Abstract

Paper Structure (26 sections, 9 equations, 7 figures, 1 table)

This paper contains 26 sections, 9 equations, 7 figures, 1 table.

Introduction
Related Work
Classical and Statistical Methods
Autoencoder and Embedding-Based Methods
Distribution Models
Modern methods
Problem Formulation
Methodology
Training the World Model
Non-conformity scores
WM uncertainty
WM prediction error
logpZO xu2025can
AE reconstruction error japkowicz1995noveltykirichenko2020why
AE sim($z, z_{\textnormal{safe}}$) sinha2024real
...and 11 more sections

Figures (7)

Figure 1: A schematic of the training process for the proposed method. A trajectory $\tau=(s_1, a_1, s_2, a_2, \dots)$ is sampled from the dataset and sliced into training samples consisting of a fixed history window $\mathbf{h}_t = (s_{t-H+1}, a_{t-H+1}, \dots, s_t, a_t)$ and future state $s_{t+1}$. The multi-step, multi-view images captured at the robot's grippers ( $\hat{s}_{t}^{(1)}$ and $\hat{s}_{t}^{(2)}$) are encoded using a foundation video encoder (Cosmos) and---alongside learned projections of the historical proprioceptive states $s_t^\theta$ and actions $a_t$---are combined into latent feature maps. These are processed sequentially by a transformer to predict a tensor of distributions, the standard deviations of which quantify the uncertainty of the future prediction. This uncertainty is lower for nominal inputs (like those observed during training), and higher for anomalous inputs associated with failures. In other words: $\texttt{WM}(h_t) \approx f(h_t)$ only when $h_t\in\mathcal{M}_{\textnormal{nom}}$. Sampling from this latent distribution gives a latent feature map $Z\sim\mathcal{N}(\mu, \sigma^2)$, which can be decoded and refined into predictions for the next immediate state ( $\hat{s}_{t+1}^{(1)}$, $\hat{s}_{t+1}^{(2)}$, $\hat{s}_t^\theta$).
Figure 2: Overview of datasets used in this work: (a) Push-T dataset, (b) Bimanual Cable Manipulation dataset.
Figure 3: Histograms of the WM uncertainty scores for nominal (blue), anomalous recolored, and anomalous dynamically altered inputs, with overlaid visualizations of what these anomalous inputs look like. In each example the $90\%$ conformal prediction threshold is rendered in red. To test our method's ability to handle visual and proprioceptive anomalies, we alter the color to orange (top left) and green (bottom left) as well as altering the friction to half the nominal friction coefficient (top right) and zero (bottom right).
Figure 4: The WR1 Data Center Mobile Manipulator robot.
Figure 5: We find that our WM-based approaches outperform competing methods on the Bimanual Cable Manipulation dataset, presumably because of how they uniquely incorporate historical context. See Table \ref{['tab:classification_results']} for a class-based breakdown.
...and 2 more figures

Foundational World Models Accurately Detect Bimanual Manipulator Failures

TL;DR

Abstract

Foundational World Models Accurately Detect Bimanual Manipulator Failures

Authors

TL;DR

Abstract

Table of Contents

Figures (7)