Transfer learning for conflict and duplicate detection in software requirement pairs

Garima Malik; Savas Yildirim; Mucahit Cevik; Ayse Bener; Devang Parikh

Transfer learning for conflict and duplicate detection in software requirement pairs

Garima Malik, Savas Yildirim, Mucahit Cevik, Ayse Bener, Devang Parikh

TL;DR

This paper tackles automatic detection of conflicting and duplicate software requirements. It introduces SR-BERT, a Sentence-BERT based bi-encoder, combined with sequential transfer learning (e.g., MNLI pretraining followed by CDN fine-tuning) and cross-domain transfer strategies, plus rule-based filtering to refine predictions. The method encodes requirement pairs using the representation $R_1 \oplus R_2 \oplus (R_1 - R_2)$, and is evaluated on a proprietary CDN dataset and four open-source SRS datasets, showing strong performance on larger datasets and promising cross-domain generalization when augmented with information extraction rules. The work provides practical guidance for practitioners and researchers, demonstrating how sequential learning, domain adaptation, and rule-based post-processing can automate conflict and duplicate detection in RE, with dataset resources and thorough statistical validation.

Abstract

Consistent and holistic expression of software requirements is important for the success of software projects. In this study, we aim to enhance the efficiency of the software development processes by automatically identifying conflicting and duplicate software requirement specifications. We formulate the conflict and duplicate detection problem as a requirement pair classification task. We design a novel transformers-based architecture, SR-BERT, which incorporates Sentence-BERT and Bi-encoders for the conflict and duplicate identification task. Furthermore, we apply supervised multi-stage fine-tuning to the pre-trained transformer models. We test the performance of different transfer models using four different datasets. We find that sequentially trained and fine-tuned transformer models perform well across the datasets with SR-BERT achieving the best performance for larger datasets. We also explore the cross-domain performance of conflict detection models and adopt a rule-based filtering approach to validate the model classifications. Our analysis indicates that the sentence pair classification approach and the proposed transformer-based natural language processing strategies can contribute significantly to achieving automation in conflict and duplicate detection

Transfer learning for conflict and duplicate detection in software requirement pairs

TL;DR

, and is evaluated on a proprietary CDN dataset and four open-source SRS datasets, showing strong performance on larger datasets and promising cross-domain generalization when augmented with information extraction rules. The work provides practical guidance for practitioners and researchers, demonstrating how sequential learning, domain adaptation, and rule-based post-processing can automate conflict and duplicate detection in RE, with dataset resources and thorough statistical validation.

Abstract

Paper Structure (33 sections, 3 equations, 12 figures, 19 tables)

This paper contains 33 sections, 3 equations, 12 figures, 19 tables.

Introduction
Structure of the paper
Literature Review
NLP techniques in RE
Transfer learning in RE
Sentence pair classification
NLP-based conflict identification methods
NLP-based duplicate text detection
Research gaps and contributions
Methodology
Datasets
Transfer learning techniques for requirement pair classification
Sequential transfer learning
SR-BERT
Baseline classification methods
...and 18 more sections

Figures (12)

Figure 1: Cosine similarity distribution of requirement pair datasets for the respective class labels, computed using SBERT embeddings. This visualization highlights the relationships between distinct requirement pairs based on their semantic similarity as represented by the embeddings.
Figure 1: Confusion matrix for CN Classification using deberta-base-mnli transformer checkpoint. Support values for the datasets are as follows: CN (C:1,851.00, N:1,133.33), UAV (C:6.00, N:2,217.33), World Vista (C:11.66, N:3,614.33), PURE (C:6.66, N:730.33), OPENCOSS (C:3.33, N:2,258.66)
Figure 2: A visual description of the sequential transfer learning approach for requirement pair classification
Figure 2: Confusion matrix for CN Classification using bert-base-uncased-MNLI transformer checkpoint. Support values for the datasets are as follows: CN (C:1,851, N:1,133.33), UAV (C:6.00, N:2,217.33), World Vista (C:11.66, N:3,614.33), PURE (C:6.66, N:730.33), OPENCOSS (C:3.33, N:2,258.66)
Figure 3: SBERT-based model for software requirement pair classification. Individual requirements are encoded using the SBERT model to obtain their respective embeddings. These embeddings are then concatenated and fed into a linear layer, which outputs the classification result for each requirement pair.
...and 7 more figures

Transfer learning for conflict and duplicate detection in software requirement pairs

TL;DR

Abstract

Transfer learning for conflict and duplicate detection in software requirement pairs

Authors

TL;DR

Abstract

Table of Contents

Figures (12)