Investigating Software Aging in LLM-Generated Software Systems
César Santos, Ermeson Andrade, Roberto Natella
TL;DR
This study investigates the long-term reliability of software generated by large language models by conducting 50-hour load tests on four Bolt-generated service applications derived from Baxbench prompts. Using memory, throughput, and response-time metrics, along with Mann-Kendall tests and Sen’s slope estimates, the authors provide empirical evidence of software aging in automatically generated software, with aging signatures varying by workload type. The work contributes a reproducible methodology for aging assessment in LLM-generated code and shows that aging phenomena akin to traditional software also manifest in AI-assisted generation, necessitating monitoring and mitigation strategies. The findings have practical significance for industrial deployment of automatically generated services, highlighting the need to account for long-term reliability in AI-assisted development pipelines.
Abstract
Automatically generated software, especially code produced by Large Language Models (LLMs), is increasingly adopted to accelerate development and reduce manual effort. However, little is known about the long-term reliability of such systems under sustained execution. In this paper, we experimentally investigate the phenomenon of software aging in applications generated by LLM-based tools. Using the Bolt platform and standardized prompts from Baxbench, we generated four service-oriented applications and subjected them to 50-hour load tests. Resource usage, response time, and throughput were continuously monitored to detect degradation patterns. The results reveal significant evidence of software aging, including progressive memory growth, increased response time, and performance instability across all applications. Statistical analyzes confirm these trends and highlight variability in the severity of aging according to the type of application. Our findings show the need to consider aging in automatically generated software and provide a foundation for future studies on mitigation strategies and long-term reliability evaluation.
