PELLI: Framework to effectively integrate LLMs for quality software generation
Rasmus Krebs, Somnath Mazumdar
TL;DR
PELLI introduces an iterative, analysis-driven framework for integrating LLMs into quality software generation, pairing LLM-driven code with expert-domain refinement. The study evaluates five state-of-the-art LLMs across three Python-based domains using stratified prompts and static analysis to measure maintainability, performance, and reliability. Key findings indicate GPT-4T and Gemini perform slightly better, with prompt design significantly impacting code quality and domain-dependent variability; the approach provides actionable guidance for aligning LLM outputs with real-world software standards. The framework and results offer practical pathways for deploying LLM-assisted development while highlighting avenues for extending to other languages and incorporating security considerations.
Abstract
Recent studies have revealed that when LLMs are appropriately prompted and configured, they demonstrate mixed results. Such results often meet or exceed the baseline performance. However, these comparisons have two primary issues. First, they mostly considered only reliability as a comparison metric and selected a few LLMs (such as Codex and ChatGPT) for comparision. This paper proposes a comprehensive code quality assessment framework called Programmatic Excellence via LLM Iteration (PELLI). PELLI is an iterative analysis-based process that upholds high-quality code changes. We extended the state-of-the-art by performing a comprehensive evaluation that generates quantitative metrics for analyzing three primary nonfunctional requirements (such as maintainability, performance, and reliability) while selecting five popular LLMs. For PELLI's applicability, we selected three application domains while following Python coding standards. Following this framework, practitioners can ensure harmonious integration between LLMs and human developers, ensuring that their potential is fully realized. PELLI can serve as a practical guide for developers aiming to leverage LLMs while adhering to recognized quality standards. This study's outcomes are crucial for advancing LLM technologies in real-world applications, providing stakeholders with a clear understanding of where these LLMs excel and where they require further refinement. Overall, based on three nonfunctional requirements, we have found that GPT-4T and Gemini performed slightly better. We also found that prompt design can influence the overall code quality. In addition, each application domain demonstrated high and low scores across various metrics, and even within the same metrics across different prompts.
