Code Linting using Language Models
Darren Holden, Nafiseh Kahani
TL;DR
This work investigates using a CodeBERT-based language model to build a language-agnostic code linter that overcomes the language-specificity and performance-accuracy trade-offs of traditional static analyzers. By constructing a large Java-focused dataset (169,494 methods, 108,182 issues, 118 issue types), and training two classifiers (binary issue detector and multi-label issue-type predictor), the approach achieves roughly 0.84 accuracy for binary detection and 0.83 for multi-label classification, while offering substantial speed advantages over conventional linters. Key findings show input formatting, dataset balance, and pre-training exposure significantly influence performance, with notable gains when focusing on common issue types and leveraging CodeSearchNet pre-training. The study also identifies limitations in detecting rare issues and generalizing to unseen project domains, pointing to future work on richer context, synthetic data to balance labels, and scaling to larger models. Overall, the results suggest large language models can yield competitive, faster code linting across multiple languages and issue types, enabling more efficient CI workflows.
Abstract
Code linters play a crucial role in developing high-quality software systems by detecting potential problems (e.g., memory leaks) in the source code of systems. Despite their benefits, code linters are often language-specific, focused on certain types of issues, and prone to false positives in the interest of speed. This paper investigates whether large language models can be used to develop a more versatile code linter. Such a linter is expected to be language-independent, cover a variety of issue types, and maintain high speed. To achieve this, we collected a large dataset of code snippets and their associated issues. We then selected a language model and trained two classifiers based on the collected datasets. The first is a binary classifier that detects if the code has issues, and the second is a multi-label classifier that identifies the types of issues. Through extensive experimental studies, we demonstrated that the developed large language model-based linter can achieve an accuracy of 84.9% for the binary classifier and 83.6% for the multi-label classifier.
