Table of Contents
Fetching ...

Quantitative Analysis of Technical Debt and Pattern Violation in Large Language Model Architectures

Tyler Slater

TL;DR

The paper tackles the risk that AI-generated code to scaffold production systems may incur architectural debt, potentially compromising long-term maintainability. It proposes an empirical framework using Hexagonal Architecture constraints, AST-based static analysis, and metrics like LLOC, MI, and AVR to quantify architectural erosion across three model families. Findings show open-weight models (Llama 3 8B) incur high architectural violations and implement less logic, while proprietary models (GPT-5.1) achieve near-perfect architectural conformance, underscoring the need for architecture-guided safeguards. The study highlights a Maintainability Paradox where brevity can misrepresent quality and advocates architecture-as-guardrails and automated linting to mitigate generative debt in AI-assisted software engineering. Future work aims to quantify remediation costs with a Debt Remediation Index and explore automated refactoring workflows.

Abstract

As Large Language Models (LLMs) transition from code completion tools to autonomous system architects, their impact on long-term software maintainability remains unquantified. While existing research benchmarks functional correctness (pass@k), this study presents the first empirical framework to measure "Architectural Erosion" and the accumulation of Technical Debt in AI-synthesized microservices. We conducted a comparative pilot study of three state-of-the-art models (GPT-5.1, Claude 4.5 Sonnet, and Llama 3 8B) by prompting them to implement a standardized Book Lending Microservice under strict Hexagonal Architecture constraints. Utilizing Abstract Syntax Tree (AST) parsing, we find that while proprietary models achieve high architectural conformance (0% violation rate for GPT-5.1), open-weights models exhibit critical divergence. Specifically, Llama 3 demonstrated an 80% Architectural Violation Rate, frequently bypassing interface adapters to create illegal circular dependencies between Domain and Infrastructure layers. Furthermore, we identified a phenomenon of "Implementation Laziness," where open-weights models generated 60% fewer Logical Lines of Code (LLOC) than their proprietary counterparts, effectively omitting complex business logic to satisfy token constraints. These findings suggest that without automated architectural linting, utilizing smaller open-weights models for system scaffolding accelerates the accumulation of structural technical debt.

Quantitative Analysis of Technical Debt and Pattern Violation in Large Language Model Architectures

TL;DR

The paper tackles the risk that AI-generated code to scaffold production systems may incur architectural debt, potentially compromising long-term maintainability. It proposes an empirical framework using Hexagonal Architecture constraints, AST-based static analysis, and metrics like LLOC, MI, and AVR to quantify architectural erosion across three model families. Findings show open-weight models (Llama 3 8B) incur high architectural violations and implement less logic, while proprietary models (GPT-5.1) achieve near-perfect architectural conformance, underscoring the need for architecture-guided safeguards. The study highlights a Maintainability Paradox where brevity can misrepresent quality and advocates architecture-as-guardrails and automated linting to mitigate generative debt in AI-assisted software engineering. Future work aims to quantify remediation costs with a Debt Remediation Index and explore automated refactoring workflows.

Abstract

As Large Language Models (LLMs) transition from code completion tools to autonomous system architects, their impact on long-term software maintainability remains unquantified. While existing research benchmarks functional correctness (pass@k), this study presents the first empirical framework to measure "Architectural Erosion" and the accumulation of Technical Debt in AI-synthesized microservices. We conducted a comparative pilot study of three state-of-the-art models (GPT-5.1, Claude 4.5 Sonnet, and Llama 3 8B) by prompting them to implement a standardized Book Lending Microservice under strict Hexagonal Architecture constraints. Utilizing Abstract Syntax Tree (AST) parsing, we find that while proprietary models achieve high architectural conformance (0% violation rate for GPT-5.1), open-weights models exhibit critical divergence. Specifically, Llama 3 demonstrated an 80% Architectural Violation Rate, frequently bypassing interface adapters to create illegal circular dependencies between Domain and Infrastructure layers. Furthermore, we identified a phenomenon of "Implementation Laziness," where open-weights models generated 60% fewer Logical Lines of Code (LLOC) than their proprietary counterparts, effectively omitting complex business logic to satisfy token constraints. These findings suggest that without automated architectural linting, utilizing smaller open-weights models for system scaffolding accelerates the accumulation of structural technical debt.

Paper Structure

This paper contains 22 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: The Automated Evaluation Pipeline. Code is generated via $\text{API}$, parsed into Abstract Syntax Trees ($\text{AST}$) to detect import statements, and graded against the Hexagonal Architecture constraints.
  • Figure 2: Ideal vs. Observed Architecture. (A) The requested Hexagonal Architecture where dependencies point INWARD via Ports. (B) The observed "Hallucinated Coupling" in Llama 3, where the Domain Layer illegally imports a concrete Infrastructure dependency.
  • Figure 3: Distribution of Implementation Density ($\text{LLOC}$). The proprietary models ($\text{GPT-5.1}$, $\text{Claude 4.5}$) show high density, while the Llama 3 cluster demonstrates Implementation Laziness, yielding trivial/incomplete implementations of the requested caching mechanism.
  • Figure 4: Architectural Violation Rate ($\text{AVR}$), defined as the percentage of runs where the Domain Layer illegally imported an Infrastructure or external concrete dependency. Llama 3 exhibits critical failure rates due to "Hallucinated Coupling."
  • Figure 5: The Maintainability-Completeness Tradeoff. This scatter plot demonstrates the inverse correlation in Llama 3 (Red), where high $\text{MI}$ is a statistical artifact of low $\text{LLOC}$ (Zone of Laziness), suggesting the model avoided complex logic. The $\text{GPT-5.1}$ cluster (Blue) represents the "Robust" zone (high volume, acceptable $\text{MI}$).
  • ...and 1 more figures