Table of Contents
Fetching ...

Who's Who? LLM-assisted Software Traceability with Architecture Entity Recognition

Dominik Fuchß, Haoyu Liu, Sophie Corallo, Tobias Hey, Jan Keim, Johannes von Geisau, Anne Koziolek

TL;DR

This work tackles the semantic gap in software architecture traceability by introducing two LLM-based approaches: ExArch, which generates a simple SAM from SAD and code to enable TransArC without manual SAMs, and ArTEMiS, which performs architecture-entity NER on SAD and links entities to SAM. Empirical results show ExArch achieving a comparable F1 to TransArC with manual SAM (≈0.87 vs 0.86) and outperforming non-SAM baselines, while ArTEMiS rivals heuristic SWATTR and, when combined with TransArC, boosts SAD-to-code TLR performance. The combined configuration (TransArC + ExArch + ArTEMiS) yields the best results among non-manual SAM methods, highlighting the practical potential of LLM-assisted SAM generation and entity mapping for architecture-code traceability. Collectively, the paper demonstrates that LLMs can automate SAM generation and architecture-aware linking, making TLR more scalable and accessible in real-world projects.

Abstract

Identifying architecturally relevant entities in textual artifacts is crucial for Traceability Link Recovery (TLR) between Software Architecture Documentation (SAD) and source code. While Software Architecture Models (SAMs) can bridge the semantic gap between these artifacts, their manual creation is time-consuming. Large Language Models (LLMs) offer new capabilities for extracting architectural entities from SAD and source code to construct SAMs automatically or establish direct trace links. This paper presents two LLM-based approaches: ExArch extracts component names as simple SAMs from SAD and source code to eliminate the need for manual SAM creation, while ArTEMiS identifies architectural entities in documentation and matches them with (manually or automatically generated) SAM entities. Our evaluation compares against state-of-the-art approaches SWATTR, TransArC and ArDoCode. TransArC achieves strong performance (F1: 0.87) but requires manually created SAMs; ExArch achieves comparable results (F1: 0.86) using only SAD and code. ArTEMiS is on par with the traditional heuristic-based SWATTR (F1: 0.81) and can successfully replace it when integrated with TransArC. The combination of ArTEMiS and ExArch outperforms ArDoCode, the best baseline without manual SAMs. Our results demonstrate that LLMs can effectively identify architectural entities in textual artifacts, enabling automated SAM generation and TLR, making architecture-code traceability more practical and accessible.

Who's Who? LLM-assisted Software Traceability with Architecture Entity Recognition

TL;DR

This work tackles the semantic gap in software architecture traceability by introducing two LLM-based approaches: ExArch, which generates a simple SAM from SAD and code to enable TransArC without manual SAMs, and ArTEMiS, which performs architecture-entity NER on SAD and links entities to SAM. Empirical results show ExArch achieving a comparable F1 to TransArC with manual SAM (≈0.87 vs 0.86) and outperforming non-SAM baselines, while ArTEMiS rivals heuristic SWATTR and, when combined with TransArC, boosts SAD-to-code TLR performance. The combined configuration (TransArC + ExArch + ArTEMiS) yields the best results among non-manual SAM methods, highlighting the practical potential of LLM-assisted SAM generation and entity mapping for architecture-code traceability. Collectively, the paper demonstrates that LLMs can automate SAM generation and architecture-aware linking, making TLR more scalable and accessible in real-world projects.

Abstract

Identifying architecturally relevant entities in textual artifacts is crucial for Traceability Link Recovery (TLR) between Software Architecture Documentation (SAD) and source code. While Software Architecture Models (SAMs) can bridge the semantic gap between these artifacts, their manual creation is time-consuming. Large Language Models (LLMs) offer new capabilities for extracting architectural entities from SAD and source code to construct SAMs automatically or establish direct trace links. This paper presents two LLM-based approaches: ExArch extracts component names as simple SAMs from SAD and source code to eliminate the need for manual SAM creation, while ArTEMiS identifies architectural entities in documentation and matches them with (manually or automatically generated) SAM entities. Our evaluation compares against state-of-the-art approaches SWATTR, TransArC and ArDoCode. TransArC achieves strong performance (F1: 0.87) but requires manually created SAMs; ExArch achieves comparable results (F1: 0.86) using only SAD and code. ArTEMiS is on par with the traditional heuristic-based SWATTR (F1: 0.81) and can successfully replace it when integrated with TransArC. The combination of ArTEMiS and ExArch outperforms ArDoCode, the best baseline without manual SAMs. Our results demonstrate that LLMs can effectively identify architectural entities in textual artifacts, enabling automated SAM generation and TLR, making architecture-code traceability more practical and accessible.

Paper Structure

This paper contains 39 sections, 1 equation, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of the ExArch Approach for TLR. Artifacts in Orange, Prompting in Blue, Extraction of Features in White, and TransArC in Purple.
  • Figure 2: Comparison of extracted SAMs for MediaStore using SADs
  • Figure 3: Comparison of extracted SAMs. For JabRef the Code-extracted components in the picture cover only the main components, determined by Llama 3.1 70b.