From Perception Logs to Failure Modes: Language-Driven Semantic Clustering of Failures for Robot Safety
Aryaman Gupta, Yusuf Umut Ciftci, Somil Bansal
TL;DR
The paper addresses safety-critical robotic failures that are diverse and long-tail, making manual analysis impractical. It proposes a closed-loop, language-driven framework that uses Multimodal Large Language Models to infer failure causes from perception sequences, cluster them into $L$ semantic failure modes via a prompt ensemble, and assign trajectories to clusters for downstream use. Contributions include unsupervised clustering of $N$ failure sequences into human-readable categories with natural-language summaries and keywords, demonstration on RoboFail, Nexar dashcam data, and indoor navigation, and demonstrations of online failure monitoring and targeted data collection for policy refinement. The approach enables scalable, interpretable learning from real-world failures, providing early-warning signals and guiding robust policy improvement in safety-critical robotic systems.
Abstract
As robotic systems become increasingly integrated into real-world environments -- ranging from autonomous vehicles to household assistants -- they inevitably encounter diverse and unstructured scenarios that lead to failures. While such failures pose safety and reliability challenges, they also provide rich perceptual data for improving future performance. However, manually analyzing large-scale failure datasets is impractical. In this work, we present a method for automatically organizing large-scale robotic failure data into semantically meaningful failure clusters, enabling scalable learning from failure without human supervision. Our approach leverages the reasoning capabilities of Multimodal Large Language Models (MLLMs), trained on internet-scale data, to infer high-level failure causes from raw perceptual trajectories and discover interpretable structure within uncurated failure logs. These semantic clusters reveal patterns and hypothesized causes of failure, enabling scalable learning from experience. We demonstrate that the discovered failure modes can guide targeted data collection for policy refinement, accelerating iterative improvement in agent policies and overall safety. Additionally, we show that these semantic clusters can benefit online failure monitoring systems, offering a lightweight yet powerful safeguard for real-time operation. We demonstrate that this framework enhances robot learning and robustness by transforming real-world failures into actionable and interpretable signals for adaptation.
