Table of Contents
Fetching ...

Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3

Ahmed R. Sadik, Siddhata Govind

TL;DR

The study addresses the problem of selecting an effective LLM for cross-language code smell detection by benchmarking OpenAI GPT-4.0 and DeepSeek-V3 against SonarQube on a multilingual, ground-truth dataset spanning Java, Python, JavaScript, and C++. It adopts a structured evaluation with standardized prompts and examines model-level, category-level, type-level, and language-specific performance, using Precision, Recall, and F1 as metrics, with $Precision = \frac{TP}{TP+FP}$, $Recall = \frac{TP}{TP+FN}$, and $F1 = 2 \times \frac{Precision \times Recall}{Precision+Recall}$. The results show GPT-4.0 achieves higher precision (0.79) than DeepSeek-V3 (0.42) but both models suffer from relatively low recall (0.41 and 0.31, respectively), while DeepSeek incurs more false positives. A cost analysis contrasts token-based pricing for GPT-4.0 with fixed, complexity-based pricing for DeepSeek-V3, highlighting the trade-off between accuracy and affordability, and a SonarQube comparison suggests a hybrid approach combining deterministic static analysis with semantic, model-driven detection. Overall, the work provides actionable guidance for practitioners on deploying cost-effective, accurate code smell detection and points to avenues for improving recall via prompt engineering, few-shot learning, and integration with traditional static analysis in real-world pipelines.

Abstract

Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection

Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3

TL;DR

The study addresses the problem of selecting an effective LLM for cross-language code smell detection by benchmarking OpenAI GPT-4.0 and DeepSeek-V3 against SonarQube on a multilingual, ground-truth dataset spanning Java, Python, JavaScript, and C++. It adopts a structured evaluation with standardized prompts and examines model-level, category-level, type-level, and language-specific performance, using Precision, Recall, and F1 as metrics, with , , and . The results show GPT-4.0 achieves higher precision (0.79) than DeepSeek-V3 (0.42) but both models suffer from relatively low recall (0.41 and 0.31, respectively), while DeepSeek incurs more false positives. A cost analysis contrasts token-based pricing for GPT-4.0 with fixed, complexity-based pricing for DeepSeek-V3, highlighting the trade-off between accuracy and affordability, and a SonarQube comparison suggests a hybrid approach combining deterministic static analysis with semantic, model-driven detection. Overall, the work provides actionable guidance for practitioners on deploying cost-effective, accurate code smell detection and points to avenues for improving recall via prompt engineering, few-shot learning, and integration with traditional static analysis in real-world pipelines.

Abstract

Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection

Paper Structure

This paper contains 18 sections, 1 equation, 10 figures, 9 tables.

Figures (10)

  • Figure 1: LLM-based collaboration in software engineering.
  • Figure 2: Codesmells Taxonomy.
  • Figure 3: Generic class diagram representing the generic system implemented in Java, JavaScript, Python, and C++.
  • Figure 4: Heatmaps representing various metrics across Java, JavaScript, Python, and C++ implementations. (a) Lines of Code (LOC), (b) Number of Methods, (c) Number of Attributes, (d) Code Smells Distribution.
  • Figure 5: Performance comparison of GPT-4.0 and DeepSeek-V3 in code smell detection.
  • ...and 5 more figures