A Causal Perspective on Measuring, Explaining and Mitigating Smells in LLM-Generated Code
Alejandro Velasco, Daniel Rodriguez-Cardenas, Dipin Khati, David N. Palacio, Luftar Rahman Alif, Denys Poshyvanyk
TL;DR
This work treats code smells in LLM-generated code as a tractable causal problem and introduces Propensity Smelly Score (PSC), a next-token–based probabilistic measure of smell propensity. It validates PSC’s robustness under semantic-preserving transformations, then uses a structured causal model to quantify how generation strategy, model architecture, prompts, and size influence smell propensity, demonstrating effective prompt-based mitigation. A user study shows PSC aids developers in interpreting model behavior and judging code quality, supporting practical adoption of quality-aware evaluation. The study also provides a CodeSmellData 2.0-based benchmark and methodological framework for explaining and mitigating smells in LLM-driven code production.
Abstract
Recent advances in large language models (LLMs) have accelerated their adoption in software engineering contexts. However, concerns persist about the structural quality of the code they produce. In particular, LLMs often replicate poor coding practices, introducing code smells (i.e., patterns that hinder readability, maintainability, or design integrity). Although prior research has examined the detection or repair of smells, we still lack a clear understanding of how and when these issues emerge in generated code. This paper addresses this gap by systematically measuring, explaining and mitigating smell propensity in LLM-generated code. We build on the Propensity Smelly Score (PSC), a probabilistic metric that estimates the likelihood of generating particular smell types, and establish its robustness as a signal of structural quality. Using PSC as an instrument for causal analysis, we identify how generation strategy, model size, model architecture and prompt formulation shape the structural properties of generated code. Our findings show that prompt design and architectural choices play a decisive role in smell propensity and motivate practical mitigation strategies that reduce its occurrence. A user study further demonstrates that PSC helps developers interpret model behavior and assess code quality, providing evidence that smell propensity signals can support human judgement. Taken together, our work lays the groundwork for integrating quality-aware assessments into the evaluation and deployment of LLMs for code.
