Diagnosing Robotics Systems Issues with Large Language Models

Jordis Emilia Herrmann; Aswath Mandakath Gopinath; Mikael Norrlof; Mark Niklas Müller

Diagnosing Robotics Systems Issues with Large Language Models

Jordis Emilia Herrmann, Aswath Mandakath Gopinath, Mikael Norrlof, Mark Niklas Müller

TL;DR

The paper tackles the challenge of diagnosing root causes in complex robotics systems by leveraging large language models. It introduces SysDiagBench, a proprietary benchmark with over 2,500 real-world robotics tickets, to evaluate root-cause prediction from ticket data and logs. Through a systematic study of LLM-based diagnostics, including zero-shot prompting, full fine-tuning, LoRA, and QLoRA, it demonstrates that a 7B-parameter model finetuned with QLoRA can outperform GPT-4 in diagnostic accuracy at reduced cost, with results validated by a human expert study. The work shows LLMs can meaningfully aid robotic-system troubleshooting, significantly speeding issue resolution while acknowledging that human experts remain essential for final judgment and accountability.

Abstract

Quickly resolving issues reported in industrial applications is crucial to minimize economic impact. However, the required data analysis makes diagnosing the underlying root causes a challenging and time-consuming task, even for experts. In contrast, large language models (LLMs) excel at analyzing large amounts of data. Indeed, prior work in AI-Ops demonstrates their effectiveness in analyzing IT systems. Here, we extend this work to the challenging and largely unexplored domain of robotics systems. To this end, we create SYSDIAGBENCH, a proprietary system diagnostics benchmark for robotics, containing over 2500 reported issues. We leverage SYSDIAGBENCH to investigate the performance of LLMs for root cause analysis, considering a range of model sizes and adaptation techniques. Our results show that QLoRA finetuning can be sufficient to let a 7B-parameter model outperform GPT-4 in terms of diagnostic accuracy while being significantly more cost-effective. We validate our LLM-as-a-judge results with a human expert study and find that our best model achieves similar approval ratings as our reference labels.

Diagnosing Robotics Systems Issues with Large Language Models

TL;DR

Abstract

Diagnosing Robotics Systems Issues with Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)