LLMs as Judges: Toward The Automatic Review of GSN-compliant Assurance Cases
Gerhard Yu, Mithila Sivakumar, Alvine B. Belle, Soude Ghari, Song Wang, Timothy C. Lethbridge
TL;DR
This work introduces an LLM-as-a-judge framework to automate the review of GSN-based assurance cases, addressing the prevalent manual bottleneck in safety-critical domains. It formalizes four established assurance-case review criteria via predicate-based rules and translates them into tailored prompts (including CoT variants) to query multiple state-of-the-art LLMs. Experimental results across four LLMs and five real-world assurance cases show that DeepSeek-R1 and GPT-4.1 deliver the strongest automated reviews, particularly with One-shot with CoT prompting, yet human reviewers remain essential to validate, refine, and contextualize LLM outputs. The study contributes a concrete four-phase methodology (collection, textual conversion, criteria formalization, and prompt engineering) and a taxonomy of LLM review capabilities and issues, highlighting both promise and current limitations of fully automated assurance-case review.
Abstract
Assurance cases allow verifying the correct implementation of certain non-functional requirements of mission-critical systems, including their safety, security, and reliability. They can be used in the specification of autonomous driving, avionics, air traffic control, and similar systems. They aim to reduce risks of harm of all kinds including human mortality, environmental damage, and financial loss. However, assurance cases often tend to be organized as extensive documents spanning hundreds of pages, making their creation, review, and maintenance error-prone, time-consuming, and tedious. Therefore, there is a growing need to leverage (semi-)automated techniques, such as those powered by generative AI and large language models (LLMs), to enhance efficiency, consistency, and accuracy across the entire assurance-case lifecycle. In this paper, we focus on assurance case review, a critical task that ensures the quality of assurance cases and therefore fosters their acceptance by regulatory authorities. We propose a novel approach that leverages the \textit{LLM-as-a-judge} paradigm to automate the review process. Specifically, we propose new predicate-based rules that formalize well-established assurance case review criteria, allowing us to craft LLM prompts tailored to the review task. Our experiments on several state-of-the-art LLMs (GPT-4o, GPT-4.1, DeepSeek-R1, and Gemini 2.0 Flash) show that, while most LLMs yield relatively good review capabilities, DeepSeek-R1 and GPT-4.1 demonstrate superior performance, with DeepSeek-R1 ultimately outperforming GPT-4.1. However, our experimental results also suggest that human reviewers are still needed to refine the reviews LLMs yield.
