Table of Contents
Fetching ...

Diagnosing Robotics Systems Issues with Large Language Models

Jordis Emilia Herrmann, Aswath Mandakath Gopinath, Mikael Norrlof, Mark Niklas Müller

TL;DR

The paper tackles the challenge of diagnosing root causes in complex robotics systems by leveraging large language models. It introduces SysDiagBench, a proprietary benchmark with over 2,500 real-world robotics tickets, to evaluate root-cause prediction from ticket data and logs. Through a systematic study of LLM-based diagnostics, including zero-shot prompting, full fine-tuning, LoRA, and QLoRA, it demonstrates that a 7B-parameter model finetuned with QLoRA can outperform GPT-4 in diagnostic accuracy at reduced cost, with results validated by a human expert study. The work shows LLMs can meaningfully aid robotic-system troubleshooting, significantly speeding issue resolution while acknowledging that human experts remain essential for final judgment and accountability.

Abstract

Quickly resolving issues reported in industrial applications is crucial to minimize economic impact. However, the required data analysis makes diagnosing the underlying root causes a challenging and time-consuming task, even for experts. In contrast, large language models (LLMs) excel at analyzing large amounts of data. Indeed, prior work in AI-Ops demonstrates their effectiveness in analyzing IT systems. Here, we extend this work to the challenging and largely unexplored domain of robotics systems. To this end, we create SYSDIAGBENCH, a proprietary system diagnostics benchmark for robotics, containing over 2500 reported issues. We leverage SYSDIAGBENCH to investigate the performance of LLMs for root cause analysis, considering a range of model sizes and adaptation techniques. Our results show that QLoRA finetuning can be sufficient to let a 7B-parameter model outperform GPT-4 in terms of diagnostic accuracy while being significantly more cost-effective. We validate our LLM-as-a-judge results with a human expert study and find that our best model achieves similar approval ratings as our reference labels.

Diagnosing Robotics Systems Issues with Large Language Models

TL;DR

The paper tackles the challenge of diagnosing root causes in complex robotics systems by leveraging large language models. It introduces SysDiagBench, a proprietary benchmark with over 2,500 real-world robotics tickets, to evaluate root-cause prediction from ticket data and logs. Through a systematic study of LLM-based diagnostics, including zero-shot prompting, full fine-tuning, LoRA, and QLoRA, it demonstrates that a 7B-parameter model finetuned with QLoRA can outperform GPT-4 in diagnostic accuracy at reduced cost, with results validated by a human expert study. The work shows LLMs can meaningfully aid robotic-system troubleshooting, significantly speeding issue resolution while acknowledging that human experts remain essential for final judgment and accountability.

Abstract

Quickly resolving issues reported in industrial applications is crucial to minimize economic impact. However, the required data analysis makes diagnosing the underlying root causes a challenging and time-consuming task, even for experts. In contrast, large language models (LLMs) excel at analyzing large amounts of data. Indeed, prior work in AI-Ops demonstrates their effectiveness in analyzing IT systems. Here, we extend this work to the challenging and largely unexplored domain of robotics systems. To this end, we create SYSDIAGBENCH, a proprietary system diagnostics benchmark for robotics, containing over 2500 reported issues. We leverage SYSDIAGBENCH to investigate the performance of LLMs for root cause analysis, considering a range of model sizes and adaptation techniques. Our results show that QLoRA finetuning can be sufficient to let a 7B-parameter model outperform GPT-4 in terms of diagnostic accuracy while being significantly more cost-effective. We validate our LLM-as-a-judge results with a human expert study and find that our best model achieves similar approval ratings as our reference labels.

Paper Structure

This paper contains 49 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Visualization of the label extraction process for historic tickets based on querying a strong LLM. Note that during inference time only the grey, but not the blue, boxes are available.
  • Figure 2: Token count distribution of processed finetuning inputs (blue) and corresponding raw logs (red).
  • Figure 3: Mean similarity score of Mistral-Lite-7B for LoRA and QLoRA training depending on rank $r$.
  • Figure 4: Frequency of human experts rating predicted RCs higher (blue), equal (orange), and lower (red) than our reference RC.
  • Figure 5: CoT prompt used for root cause extraction, where <PLACEHOLDERS> for data from the ticket are marked red.
  • ...and 5 more figures