Classification or Prompting: A Case Study on Legal Requirements Traceability

Romina Etezadi; Sallam Abualhaija; Chetan Arora; Lionel Briand

Classification or Prompting: A Case Study on Legal Requirements Traceability

Romina Etezadi, Sallam Abualhaija, Chetan Arora, Lionel Briand

TL;DR

The paper tackles legal requirements traceability by evaluating a classifier-based approach (Kashif) using sentence transformers and a prompt-driven LLM approach (Rice_LRT) built on the RICE framework. It benchmarks these methods on HIPAA (for Kashif) and GDPR (for Rice_LRT), showing Kashif achieves about 63% F2 on HIPAA while Rice_LRT attains 84% recall and 61% F2 on GDPR, illustrating complementary strengths across regulatory domains. The results indicate that domain-tailored models and engineered prompts substantially reduce manual effort, yet transfer across legal contexts remains challenging, motivating future human-in-the-loop and domain-knowledge augmentation. Overall, the study demonstrates that combining semantic similarity-based classification with carefully crafted LLM prompts can significantly advance automated legal requirements traceability, with practical implications for compliance workflows.

Abstract

New regulations are introduced to ensure software development aligns with ethical concerns and protects public safety. Showing compliance requires tracing requirements to legal provisions. Requirements traceability is a key task where engineers must analyze technical requirements against target artifacts, often within limited time. Manually analyzing complex systems with hundreds of requirements is infeasible. The legal dimension adds challenges that increase effort. In this paper, we investigate two automated solutions based on language models, including large ones (LLMs). The first solution, Kashif, is a classifier that leverages sentence transformers and semantic similarity. The second solution, RICE_LRT, prompts a recent LLM based on RICE, a prompt engineering framework. Using a publicly available benchmark dataset, we empirically evaluate Kashif and compare it against seven baseline classifiers from the literature (LSI, LDA, GloVe, TraceBERT, RoBERTa, and LLaMa). Kashif can identify trace links with F2 score of 63%, outperforming the best baseline by a substantial margin of 21 percentage points (pp) in F2 score. On a newly created and more complex requirements document traced to the European general data protection regulation (GDPR), RICE_LRT outperforms Kashif and baseline prompts in the literature by achieving an average recall of 84% and F2 score of 61%, improving the F2 score by 34 pp compared to the best baseline prompt. Our results indicate that requirements traceability in legal contexts cannot be adequately addressed by techniques proposed in the literature that are not specifically designed for legal artifacts. Furthermore, we demonstrate that our engineered prompt outperforms both classifier-based approaches and baseline prompts.

Classification or Prompting: A Case Study on Legal Requirements Traceability

TL;DR

Abstract

Classification or Prompting: A Case Study on Legal Requirements Traceability

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)