Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3
Ahmed R. Sadik, Siddhata Govind
TL;DR
The study addresses the problem of selecting an effective LLM for cross-language code smell detection by benchmarking OpenAI GPT-4.0 and DeepSeek-V3 against SonarQube on a multilingual, ground-truth dataset spanning Java, Python, JavaScript, and C++. It adopts a structured evaluation with standardized prompts and examines model-level, category-level, type-level, and language-specific performance, using Precision, Recall, and F1 as metrics, with $Precision = \frac{TP}{TP+FP}$, $Recall = \frac{TP}{TP+FN}$, and $F1 = 2 \times \frac{Precision \times Recall}{Precision+Recall}$. The results show GPT-4.0 achieves higher precision (0.79) than DeepSeek-V3 (0.42) but both models suffer from relatively low recall (0.41 and 0.31, respectively), while DeepSeek incurs more false positives. A cost analysis contrasts token-based pricing for GPT-4.0 with fixed, complexity-based pricing for DeepSeek-V3, highlighting the trade-off between accuracy and affordability, and a SonarQube comparison suggests a hybrid approach combining deterministic static analysis with semantic, model-driven detection. Overall, the work provides actionable guidance for practitioners on deploying cost-effective, accurate code smell detection and points to avenues for improving recall via prompt engineering, few-shot learning, and integration with traditional static analysis in real-world pipelines.
Abstract
Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection
