Table of Contents
Fetching ...

Synergistic Enhancement of Requirement-to-Code Traceability: A Framework Combining Large Language Model based Data Augmentation and an Advanced Encoder

Jianzhang Zhang, Jialong Zhou, Nan Niu, Jinping Hua, Chuang Liu

TL;DR

This study tackles the data scarcity challenge in requirement-to-code traceability by proposing a synergistic framework that combines LLM-driven data augmentation with an advanced, language-aligned encoder. The method systematically evaluates prompting strategies, demonstrates robustness across multiple leading LLMs, and shows that encoder alignment with target languages yields substantial gains. On four public datasets, data augmentation alone improved performance by up to $26.66\%$ in $F_1$, while a more capable encoder added up to $11.25\%$ more, achieving up to $28.59\%$ in $F_1$ and $28.9\%$ in $F_2$ over baselines and outperforming ten established methods. The results offer a pragmatic, scalable path for deploying data-driven RTLR in industry, particularly where labeled data is scarce.

Abstract

Automated requirement-to-code traceability link recovery, essential for industrial system quality and safety, is critically hindered by the scarcity of labeled data. To address this bottleneck, this paper proposes and validates a synergistic framework that integrates large language model (LLM)-driven data augmentation with an advanced encoder. We first demonstrate that data augmentation, optimized through a systematic evaluation of bi-directional and zero/few-shot prompting strategies, is highly effective, while the choice among leading LLMs is not a significant performance factor. Building on the augmented data, we further enhance an established, state-of-the-art pre-trained language model based method by incorporating an encoder distinguished by a broader pre-training corpus and an extended context window. Our experiments on four public datasets quantify the distinct contributions of our framework's components: on its own, data augmentation consistently improves the baseline method, providing substantial performance gains of up to 26.66%; incorporating the advanced encoder provides an additional lift of 2.21% to 11.25%. This synergy culminates in a fully optimized framework with maximum gains of up to 28.59% on $F_1$ score and 28.9% on $F_2$ score over the established baseline, decisively outperforming ten established baselines from three dominant paradigms. This work contributes a pragmatic and scalable methodology to overcome the data scarcity bottleneck, paving the way for broader industrial adoption of data-driven requirement-to-code traceability.

Synergistic Enhancement of Requirement-to-Code Traceability: A Framework Combining Large Language Model based Data Augmentation and an Advanced Encoder

TL;DR

This study tackles the data scarcity challenge in requirement-to-code traceability by proposing a synergistic framework that combines LLM-driven data augmentation with an advanced, language-aligned encoder. The method systematically evaluates prompting strategies, demonstrates robustness across multiple leading LLMs, and shows that encoder alignment with target languages yields substantial gains. On four public datasets, data augmentation alone improved performance by up to in , while a more capable encoder added up to more, achieving up to in and in over baselines and outperforming ten established methods. The results offer a pragmatic, scalable path for deploying data-driven RTLR in industry, particularly where labeled data is scarce.

Abstract

Automated requirement-to-code traceability link recovery, essential for industrial system quality and safety, is critically hindered by the scarcity of labeled data. To address this bottleneck, this paper proposes and validates a synergistic framework that integrates large language model (LLM)-driven data augmentation with an advanced encoder. We first demonstrate that data augmentation, optimized through a systematic evaluation of bi-directional and zero/few-shot prompting strategies, is highly effective, while the choice among leading LLMs is not a significant performance factor. Building on the augmented data, we further enhance an established, state-of-the-art pre-trained language model based method by incorporating an encoder distinguished by a broader pre-training corpus and an extended context window. Our experiments on four public datasets quantify the distinct contributions of our framework's components: on its own, data augmentation consistently improves the baseline method, providing substantial performance gains of up to 26.66%; incorporating the advanced encoder provides an additional lift of 2.21% to 11.25%. This synergy culminates in a fully optimized framework with maximum gains of up to 28.59% on score and 28.9% on score over the established baseline, decisively outperforming ten established baselines from three dominant paradigms. This work contributes a pragmatic and scalable methodology to overcome the data scarcity bottleneck, paving the way for broader industrial adoption of data-driven requirement-to-code traceability.

Paper Structure

This paper contains 34 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: A Prototypical Scenario of Requirement-to-Code Traceability Link Recovery. The figure illustrates the recovery of links for a modified requirement ($R_{2}^{\prime}$), which is representative of broader RTLR challenges, such as handling new requirements and code artifacts.
  • Figure 2: An Overview of the Three-Stage Synergistic Framework for Requirement-to-Code Traceability, Integrating LLM-driven Data Augmentation with an Enhanced Model Architecture.
  • Figure 3: The Structured Zero-shot Prompt Template for Generating Code from a Requirement.
  • Figure 4: The Structured Zero-shot Prompt Template for Generating a Requirement from Code.
  • Figure 5: The Structured Few-shot Prompt Template for Generating Code from a Requirement.
  • ...and 5 more figures