Large language models for generating rules, yay or nay?

Shangeetha Sivasothy; Scott Barnett; Rena Logothetis; Mohamed Abdelrazek; Zafaryab Rasool; Srikanth Thudumu; Zac Brannelly

Large language models for generating rules, yay or nay?

Shangeetha Sivasothy, Scott Barnett, Rena Logothetis, Mohamed Abdelrazek, Zafaryab Rasool, Srikanth Thudumu, Zac Brannelly

TL;DR

The paper investigates using large language models (GPT-3.5 and GPT-4) as a world-knowledge source to bootstrap logic rules for safety-critical software in healthcare, aiming to reduce the requirements elicitation burden on subject-matter experts. It introduces RuleFlex, a four-component pipeline (linguistic interface, rule generation engine, dynamic rule modifier, API generator) that generates, compares, and deploys rule sets, with SME/developer collaboration to refine them. Four prompting techniques are evaluated, with few-shot prompting most closely aligning with the target PiMS rules, though LLMs produce fewer rules and cannot reliably generate per-rule thresholds. The study demonstrates that LLMs can bootstrap initial rule sets and provide a domain-wide perspective, accelerating early rule elicitation while highlighting limitations in domain-variable coverage and threshold specification, informing directions for broader deployment and future enhancement in healthcare and other domains.

Abstract

Engineering safety-critical systems such as medical devices and digital health intervention systems is complex, where long-term engagement with subject-matter experts (SMEs) is needed to capture the systems' expected behaviour. In this paper, we present a novel approach that leverages Large Language Models (LLMs), such as GPT-3.5 and GPT-4, as a potential world model to accelerate the engineering of software systems. This approach involves using LLMs to generate logic rules, which can then be reviewed and informed by SMEs before deployment. We evaluate our approach using a medical rule set, created from the pandemic intervention monitoring system in collaboration with medical professionals during COVID-19. Our experiments show that 1) LLMs have a world model that bootstraps implementation, 2) LLMs generated less number of rules compared to experts, and 3) LLMs do not have the capacity to generate thresholds for each rule. Our work shows how LLMs augment the requirements' elicitation process by providing access to a world model for domains.

Large language models for generating rules, yay or nay?

TL;DR

Abstract

Large language models for generating rules, yay or nay?

Authors

TL;DR

Abstract

Table of Contents

Figures (1)