Table of Contents
Fetching ...

Requirements Satisfiability with In-Context Learning

Sarah Santos, Travis Breaux, Thomas Norton, Sara Haghighi, Sepideh Ghanavati

TL;DR

This paper explores using in-context learning to generate and evaluate satisfaction arguments that connect a system specification and domain knowledge to a requirement, formalized as $S, K \vdash R$. It builds a three-stage workflow—knowledge extraction from regulatory guidance, specification generation from app descriptions, and satisifiability evaluation using targeted prompts and chain-of-thought reasoning—grounded in GDPR consent rules. Empirical results show GPT-4 achieving up to ~95.6% overall accuracy in satisfiability checks, with chain-of-thought prompting significantly boosting GPT-3.5 performance and a generic prompt template often outperforming domain-specific templates. The work demonstrates the practical viability of generative reasoning for requirements engineering, offers insights into prompt design trade-offs, and provides a replication package to enable further investigation and extension in legal-NLP contexts.

Abstract

Language models that can learn a task at inference time, called in-context learning (ICL), show increasing promise in natural language inference tasks. In ICL, a model user constructs a prompt to describe a task with a natural language instruction and zero or more examples, called demonstrations. The prompt is then input to the language model to generate a completion. In this paper, we apply ICL to the design and evaluation of satisfaction arguments, which describe how a requirement is satisfied by a system specification and associated domain knowledge. The approach builds on three prompt design patterns, including augmented generation, prompt tuning, and chain-of-thought prompting, and is evaluated on a privacy problem to check whether a mobile app scenario and associated design description satisfies eight consent requirements from the EU General Data Protection Regulation (GDPR). The overall results show that GPT-4 can be used to verify requirements satisfaction with 96.7% accuracy and dissatisfaction with 93.2% accuracy. Inverting the requirement improves verification of dissatisfaction to 97.2%. Chain-of-thought prompting improves overall GPT-3.5 performance by 9.0% accuracy. We discuss the trade-offs among templates, models and prompt strategies and provide a detailed analysis of the generated specifications to inform how the approach can be applied in practice.

Requirements Satisfiability with In-Context Learning

TL;DR

This paper explores using in-context learning to generate and evaluate satisfaction arguments that connect a system specification and domain knowledge to a requirement, formalized as . It builds a three-stage workflow—knowledge extraction from regulatory guidance, specification generation from app descriptions, and satisifiability evaluation using targeted prompts and chain-of-thought reasoning—grounded in GDPR consent rules. Empirical results show GPT-4 achieving up to ~95.6% overall accuracy in satisfiability checks, with chain-of-thought prompting significantly boosting GPT-3.5 performance and a generic prompt template often outperforming domain-specific templates. The work demonstrates the practical viability of generative reasoning for requirements engineering, offers insights into prompt design trade-offs, and provides a replication package to enable further investigation and extension in legal-NLP contexts.

Abstract

Language models that can learn a task at inference time, called in-context learning (ICL), show increasing promise in natural language inference tasks. In ICL, a model user constructs a prompt to describe a task with a natural language instruction and zero or more examples, called demonstrations. The prompt is then input to the language model to generate a completion. In this paper, we apply ICL to the design and evaluation of satisfaction arguments, which describe how a requirement is satisfied by a system specification and associated domain knowledge. The approach builds on three prompt design patterns, including augmented generation, prompt tuning, and chain-of-thought prompting, and is evaluated on a privacy problem to check whether a mobile app scenario and associated design description satisfies eight consent requirements from the EU General Data Protection Regulation (GDPR). The overall results show that GPT-4 can be used to verify requirements satisfaction with 96.7% accuracy and dissatisfaction with 93.2% accuracy. Inverting the requirement improves verification of dissatisfaction to 97.2%. Chain-of-thought prompting improves overall GPT-3.5 performance by 9.0% accuracy. We discuss the trade-offs among templates, models and prompt strategies and provide a detailed analysis of the generated specifications to inform how the approach can be applied in practice.
Paper Structure (19 sections, 1 figure, 3 tables)

This paper contains 19 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: $n$-Shot CoT Accuracy with Generic Template