LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World
Shrikara Arun, Meghana Tedla, Karthik Vaidhyanathan
TL;DR
The paper investigates whether large language models can automatically generate software architectural components in a Function-as-a-Service (serverless) context. Through an exploratory empirical study using function masking and prompts of increasing contextual detail, the authors evaluate functional correctness and code quality against real repositories. They find that larger LLMs and richer context (Type 3 prompts) yield higher test-pass rates and more comparable code quality, though autonomous, fully human-free generation remains challenging. The work emphasizes the value of human-in-the-loop GenAI for architectural component generation and identifies data quality and benchmarking gaps as critical future bottlenecks. Overall, the study provides a first empirical characterization of GenAI-assisted architectural component generation and outlines concrete directions for improving next-generation SA tooling.
Abstract
Recently, the exponential growth in capability and pervasiveness of Large Language Models (LLMs) has led to significant work done in the field of code generation. However, this generation has been limited to code snippets. Going one step further, our desideratum is to automatically generate architectural components. This would not only speed up development time, but would also enable us to eventually completely skip the development phase, moving directly from design decisions to deployment. To this end, we conduct an exploratory study on the capability of LLMs to generate architectural components for Functions as a Service (FaaS), commonly known as serverless functions. The small size of their architectural components make this architectural style amenable for generation using current LLMs compared to other styles like monoliths and microservices. We perform the study by systematically selecting open source serverless repositories, masking a serverless function and utilizing state of the art LLMs provided with varying levels of context information about the overall system to generate the masked function. We evaluate correctness through existing tests present in the repositories and use metrics from the Software Engineering (SE) and Natural Language Processing (NLP) domains to evaluate code quality and the degree of similarity between human and LLM generated code respectively. Along with our findings, we also present a discussion on the path forward for using GenAI in architectural component generation.
