Table of Contents
Fetching ...

Code Linting using Language Models

Darren Holden, Nafiseh Kahani

TL;DR

This work investigates using a CodeBERT-based language model to build a language-agnostic code linter that overcomes the language-specificity and performance-accuracy trade-offs of traditional static analyzers. By constructing a large Java-focused dataset (169,494 methods, 108,182 issues, 118 issue types), and training two classifiers (binary issue detector and multi-label issue-type predictor), the approach achieves roughly 0.84 accuracy for binary detection and 0.83 for multi-label classification, while offering substantial speed advantages over conventional linters. Key findings show input formatting, dataset balance, and pre-training exposure significantly influence performance, with notable gains when focusing on common issue types and leveraging CodeSearchNet pre-training. The study also identifies limitations in detecting rare issues and generalizing to unseen project domains, pointing to future work on richer context, synthetic data to balance labels, and scaling to larger models. Overall, the results suggest large language models can yield competitive, faster code linting across multiple languages and issue types, enabling more efficient CI workflows.

Abstract

Code linters play a crucial role in developing high-quality software systems by detecting potential problems (e.g., memory leaks) in the source code of systems. Despite their benefits, code linters are often language-specific, focused on certain types of issues, and prone to false positives in the interest of speed. This paper investigates whether large language models can be used to develop a more versatile code linter. Such a linter is expected to be language-independent, cover a variety of issue types, and maintain high speed. To achieve this, we collected a large dataset of code snippets and their associated issues. We then selected a language model and trained two classifiers based on the collected datasets. The first is a binary classifier that detects if the code has issues, and the second is a multi-label classifier that identifies the types of issues. Through extensive experimental studies, we demonstrated that the developed large language model-based linter can achieve an accuracy of 84.9% for the binary classifier and 83.6% for the multi-label classifier.

Code Linting using Language Models

TL;DR

This work investigates using a CodeBERT-based language model to build a language-agnostic code linter that overcomes the language-specificity and performance-accuracy trade-offs of traditional static analyzers. By constructing a large Java-focused dataset (169,494 methods, 108,182 issues, 118 issue types), and training two classifiers (binary issue detector and multi-label issue-type predictor), the approach achieves roughly 0.84 accuracy for binary detection and 0.83 for multi-label classification, while offering substantial speed advantages over conventional linters. Key findings show input formatting, dataset balance, and pre-training exposure significantly influence performance, with notable gains when focusing on common issue types and leveraging CodeSearchNet pre-training. The study also identifies limitations in detecting rare issues and generalizing to unseen project domains, pointing to future work on richer context, synthetic data to balance labels, and scaling to larger models. Overall, the results suggest large language models can yield competitive, faster code linting across multiple languages and issue types, enabling more efficient CI workflows.

Abstract

Code linters play a crucial role in developing high-quality software systems by detecting potential problems (e.g., memory leaks) in the source code of systems. Despite their benefits, code linters are often language-specific, focused on certain types of issues, and prone to false positives in the interest of speed. This paper investigates whether large language models can be used to develop a more versatile code linter. Such a linter is expected to be language-independent, cover a variety of issue types, and maintain high speed. To achieve this, we collected a large dataset of code snippets and their associated issues. We then selected a language model and trained two classifiers based on the collected datasets. The first is a binary classifier that detects if the code has issues, and the second is a multi-label classifier that identifies the types of issues. Through extensive experimental studies, we demonstrated that the developed large language model-based linter can achieve an accuracy of 84.9% for the binary classifier and 83.6% for the multi-label classifier.
Paper Structure (32 sections, 5 equations, 4 figures, 12 tables, 1 algorithm)

This paper contains 32 sections, 5 equations, 4 figures, 12 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the approach utilized in this study
  • Figure 2: A sample target method with a potential NullPointerException (as the input object may be null)
  • Figure 3: Examples of several input formats, applied to the sample target method shown in Figure \ref{['fig:sampleTargetMethod']}.
  • Figure 4: Plot Showing the Times that Each Part of Our Approach Takes for Each Component on the Infer and SpotBugs Projects (on the Left and Right of the Plot Respectively). Outliers are Excluded from this Plot.