Automated Requirements Relation Extraction

Quim Motger; Xavier Franch

Automated Requirements Relation Extraction

Quim Motger, Xavier Franch

TL;DR

This work surveys automated relation extraction for textual software requirements, framing relations as directed, typed links $r_i \xrightarrow[]{d} r_j$ to support traceability. It foregrounds two broad NLP streams—syntactic and semantic knowledge representations—and two information-extraction paradigms—retrieval-based and machine-learning-based—illustrated with the PURE dataset from ERTMS/ETCS. A key empirical comparison contrasts ontology-based OpenReq extraction with a fine-tuned BERT approach, highlighting trade-offs in interpretability, data requirements, and scalability. The authors discuss challenges such as annotated-data scarcity and non-uniform relation types, and propose future directions including encoder-based LLMs and data-augmentation through generative models to advance automated relation extraction in NLP4RE.

Abstract

In the context of requirements engineering, relation extraction involves identifying and documenting the associations between different requirements artefacts. When dealing with textual requirements (i.e., requirements expressed using natural language), relation extraction becomes a cognitively challenging task, especially in terms of ambiguity and required effort from domain-experts. Hence, in highly-adaptive, large-scale environments, effective and efficient automated relation extraction using natural language processing techniques becomes essential. In this chapter, we present a comprehensive overview of natural language-based relation extraction from text-based requirements. We initially describe the fundamentals of requirements relations based on the most relevant literature in the field, including the most common requirements relations types. The core of the chapter is composed by two main sections: (i) natural language techniques for the identification and categorization of equirements relations (i.e., syntactic vs. semantic techniques), and (ii) information extraction methods for the task of relation extraction (i.e., retrieval-based vs. machine learning-based methods). We complement this analysis with the state-of-the-art challenges and the envisioned future research directions. Overall, this chapter aims at providing a clear perspective on the theoretical and practical fundamentals in the field of natural language-based relation extraction.

Automated Requirements Relation Extraction

TL;DR

This work surveys automated relation extraction for textual software requirements, framing relations as directed, typed links

to support traceability. It foregrounds two broad NLP streams—syntactic and semantic knowledge representations—and two information-extraction paradigms—retrieval-based and machine-learning-based—illustrated with the PURE dataset from ERTMS/ETCS. A key empirical comparison contrasts ontology-based OpenReq extraction with a fine-tuned BERT approach, highlighting trade-offs in interpretability, data requirements, and scalability. The authors discuss challenges such as annotated-data scarcity and non-uniform relation types, and propose future directions including encoder-based LLMs and data-augmentation through generative models to advance automated relation extraction in NLP4RE.

Abstract

Paper Structure (28 sections, 9 figures, 3 tables)

This paper contains 28 sections, 9 figures, 3 tables.

Introduction
Fundamentals
Requirements relations
Relation types
NLP-based relation extraction
Sample data set: PURE
NLP Knowledge Representation Techniques
Text Pre-processing Techniques
Syntactic-Based NLP Knowledge Representation Techniques
Semantic-Based NLP Knowledge Representation Techniques
Information Extraction Methods
Retrieval-Based Information Extraction Methods
Linguistic and Text-based Methods.
Vectorization Methods.
Graph-based Methods.
...and 13 more sections

Figures (9)

Figure 1: Summary of NLP techniques and relation extraction methods.
Figure 2: Examples of the generation of n-grams. On the left, a parent node (systems) is the root of two nested n-grams composed by subsequent (control and train) and parallel (national) direct term children. On the right, a non-relevant child (for) from the root term (location) associates the second n-gram (current and location) with the first one (national and values).
Figure 3: Example of cross-document coreference. In the second requirement, the function refers to the core functionality transfer to shunting mentioned in the first requirement, which is the immediate document predecessor.
Figure 4: Example of named-entity recognition (NER). All reported requirements refer to the display of information in the Driver Machine Interface (DMI), an interface component between the driver and the ERTMS/ETCS system.
Figure 5: Dependency tree and named entity recognition results for a pair of annotated requirements $r_i \xrightarrow[]{requires} r_j$ from the PURE data set.
...and 4 more figures

Automated Requirements Relation Extraction

TL;DR

Abstract

Automated Requirements Relation Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (9)