Software Architecture Meets LLMs: A Systematic Literature Review
Larissa Schmid, Tobias Hey, Martin Armbruster, Sophie Corallo, Dominik Fuchß, Jan Keim, Haoyu Liu, Anne Koziolek
TL;DR
This systematic literature review analyzes 18 papers on the use of large language models (LLMs) in software architecture to map current tasks, models, optimization strategies, evaluation methods, and future directions. It finds that LLMs are applied across four main architecture-related tasks, predominantly using decoder-only models and end-to-end automation, with promising performance against baselines in many cases. Evaluation often relies on traditional metrics like precision/recall/F1 and text-generation metrics, but about a third of studies lack baseline comparisons, signaling a need for more rigorous benchmarking. The review highlights gaps such as generating source code from architectural designs, cloud-native architecture considerations, and conformance checking, and advocates for more advanced prompting techniques and continual model evaluation as LLMs evolve.
Abstract
Large Language Models (LLMs) are used for many different software engineering tasks. In software architecture, they have been applied to tasks such as classification of design decisions, detection of design patterns, and generation of software architecture design from requirements. However, there is little overview on how well they work, what challenges exist, and what open problems remain. In this paper, we present a systematic literature review on the use of LLMs in software architecture. We analyze 18 research articles to answer five research questions, such as which software architecture tasks LLMs are used for, how much automation they provide, which models and techniques are used, and how these approaches are evaluated. Our findings show that while LLMs are increasingly applied to a variety of software architecture tasks and often outperform baselines, some areas, such as generating source code from architectural design, cloud-native computing and architecture, and checking conformance remain underexplored. Although current approaches mostly use simple prompting techniques, we identify a growing research interest in refining LLM-based approaches by integrating advanced techniques.
