Table of Contents
Fetching ...

From Perception Logs to Failure Modes: Language-Driven Semantic Clustering of Failures for Robot Safety

Aryaman Gupta, Yusuf Umut Ciftci, Somil Bansal

TL;DR

The paper addresses safety-critical robotic failures that are diverse and long-tail, making manual analysis impractical. It proposes a closed-loop, language-driven framework that uses Multimodal Large Language Models to infer failure causes from perception sequences, cluster them into $L$ semantic failure modes via a prompt ensemble, and assign trajectories to clusters for downstream use. Contributions include unsupervised clustering of $N$ failure sequences into human-readable categories with natural-language summaries and keywords, demonstration on RoboFail, Nexar dashcam data, and indoor navigation, and demonstrations of online failure monitoring and targeted data collection for policy refinement. The approach enables scalable, interpretable learning from real-world failures, providing early-warning signals and guiding robust policy improvement in safety-critical robotic systems.

Abstract

As robotic systems become increasingly integrated into real-world environments -- ranging from autonomous vehicles to household assistants -- they inevitably encounter diverse and unstructured scenarios that lead to failures. While such failures pose safety and reliability challenges, they also provide rich perceptual data for improving future performance. However, manually analyzing large-scale failure datasets is impractical. In this work, we present a method for automatically organizing large-scale robotic failure data into semantically meaningful failure clusters, enabling scalable learning from failure without human supervision. Our approach leverages the reasoning capabilities of Multimodal Large Language Models (MLLMs), trained on internet-scale data, to infer high-level failure causes from raw perceptual trajectories and discover interpretable structure within uncurated failure logs. These semantic clusters reveal patterns and hypothesized causes of failure, enabling scalable learning from experience. We demonstrate that the discovered failure modes can guide targeted data collection for policy refinement, accelerating iterative improvement in agent policies and overall safety. Additionally, we show that these semantic clusters can benefit online failure monitoring systems, offering a lightweight yet powerful safeguard for real-time operation. We demonstrate that this framework enhances robot learning and robustness by transforming real-world failures into actionable and interpretable signals for adaptation.

From Perception Logs to Failure Modes: Language-Driven Semantic Clustering of Failures for Robot Safety

TL;DR

The paper addresses safety-critical robotic failures that are diverse and long-tail, making manual analysis impractical. It proposes a closed-loop, language-driven framework that uses Multimodal Large Language Models to infer failure causes from perception sequences, cluster them into semantic failure modes via a prompt ensemble, and assign trajectories to clusters for downstream use. Contributions include unsupervised clustering of failure sequences into human-readable categories with natural-language summaries and keywords, demonstration on RoboFail, Nexar dashcam data, and indoor navigation, and demonstrations of online failure monitoring and targeted data collection for policy refinement. The approach enables scalable, interpretable learning from real-world failures, providing early-warning signals and guiding robust policy improvement in safety-critical robotic systems.

Abstract

As robotic systems become increasingly integrated into real-world environments -- ranging from autonomous vehicles to household assistants -- they inevitably encounter diverse and unstructured scenarios that lead to failures. While such failures pose safety and reliability challenges, they also provide rich perceptual data for improving future performance. However, manually analyzing large-scale failure datasets is impractical. In this work, we present a method for automatically organizing large-scale robotic failure data into semantically meaningful failure clusters, enabling scalable learning from failure without human supervision. Our approach leverages the reasoning capabilities of Multimodal Large Language Models (MLLMs), trained on internet-scale data, to infer high-level failure causes from raw perceptual trajectories and discover interpretable structure within uncurated failure logs. These semantic clusters reveal patterns and hypothesized causes of failure, enabling scalable learning from experience. We demonstrate that the discovered failure modes can guide targeted data collection for policy refinement, accelerating iterative improvement in agent policies and overall safety. Additionally, we show that these semantic clusters can benefit online failure monitoring systems, offering a lightweight yet powerful safeguard for real-time operation. We demonstrate that this framework enhances robot learning and robustness by transforming real-world failures into actionable and interpretable signals for adaptation.

Paper Structure

This paper contains 30 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: A closed-loop, language-driven framework for interpretable failure mode discovery in autonomous systems. It extracts semantically meaningful failure patterns from deployment-time perceptual data without supervision and organizes them into human-understandable clusters. These clusters support downstream applications such as targeted data collection, policy refinement, and runtime failure monitoring, enabling scalable and continuous safety improvement.
  • Figure 2: A failure inference example where robot dropped a pot with water on the floor while carrying it. Frame with red border shows the frame at failure timestamp. Blue box shows the prompt and orange box shows LLMs response.
  • Figure 3: Robot manipulation failure clusters with examples.
  • Figure 4: Heatmaps comparing similarity scores between the RoboFail expert-defined failure taxonomy and the generated clusters by (a) our method, (b) BERTopic, and (c) BERTopic-LLM.
  • Figure 5: Weighted F-1 Score Comparison
  • ...and 3 more figures