Table of Contents
Fetching ...

Combining Large Language Models with Static Analyzers for Code Review Generation

Imen Jaoua, Oussama Ben Sghaier, Houari Sahraoui

TL;DR

The paper tackles the challenge of automated code review by proposing a hybrid framework that fuses knowledge-based static analysis with learning-based LLMs at three pipeline stages: data preparation (DAT), inference (RAG), and post-processing (NCO). Through experiments on a Java-focused dataset using PMD and Checkstyle and a CodeLlama-7b baseline with QLoRA fine-tuning, the authors show that RAG yields the best accuracy and that DAT and RAG provide substantial coverage improvements, while NCO offers moderate gains. A large-language-model evaluator (Llama3-70B) demonstrates substantial alignment with human judgments, supporting scalable evaluation across the dataset. The results underscore the potential of integrating static-analysis knowledge with LLMs to achieve more precise and comprehensive code-review comments, with implications for practical toolchains and future multi-language, multi-tool extensions.

Abstract

Code review is a crucial but often complex, subjective, and time-consuming activity in software development. Over the past decades, significant efforts have been made to automate this process. Early approaches focused on knowledge-based systems (KBS) that apply rule-based mechanisms to detect code issues, providing precise feedback but struggling with complex, context-dependent cases. More recent work has shifted toward fine-tuning pre-trained language models for code review, enabling broader issue coverage but often at the expense of precision. In this paper, we propose a hybrid approach that combines the strengths of KBS and learning-based systems (LBS) to generate high-quality, comprehensive code reviews. Our method integrates knowledge at three distinct stages of the language model pipeline: during data preparation (Data-Augmented Training, DAT), at inference (Retrieval-Augmented Generation, RAG), and after inference (Naive Concatenation of Outputs, NCO). We empirically evaluate our combination strategies against standalone KBS and LBS fine-tuned on a real-world dataset. Our results show that these hybrid strategies enhance the relevance, completeness, and overall quality of review comments, effectively bridging the gap between rule-based tools and deep learning models.

Combining Large Language Models with Static Analyzers for Code Review Generation

TL;DR

The paper tackles the challenge of automated code review by proposing a hybrid framework that fuses knowledge-based static analysis with learning-based LLMs at three pipeline stages: data preparation (DAT), inference (RAG), and post-processing (NCO). Through experiments on a Java-focused dataset using PMD and Checkstyle and a CodeLlama-7b baseline with QLoRA fine-tuning, the authors show that RAG yields the best accuracy and that DAT and RAG provide substantial coverage improvements, while NCO offers moderate gains. A large-language-model evaluator (Llama3-70B) demonstrates substantial alignment with human judgments, supporting scalable evaluation across the dataset. The results underscore the potential of integrating static-analysis knowledge with LLMs to achieve more precise and comprehensive code-review comments, with implications for practical toolchains and future multi-language, multi-tool extensions.

Abstract

Code review is a crucial but often complex, subjective, and time-consuming activity in software development. Over the past decades, significant efforts have been made to automate this process. Early approaches focused on knowledge-based systems (KBS) that apply rule-based mechanisms to detect code issues, providing precise feedback but struggling with complex, context-dependent cases. More recent work has shifted toward fine-tuning pre-trained language models for code review, enabling broader issue coverage but often at the expense of precision. In this paper, we propose a hybrid approach that combines the strengths of KBS and learning-based systems (LBS) to generate high-quality, comprehensive code reviews. Our method integrates knowledge at three distinct stages of the language model pipeline: during data preparation (Data-Augmented Training, DAT), at inference (Retrieval-Augmented Generation, RAG), and after inference (Naive Concatenation of Outputs, NCO). We empirically evaluate our combination strategies against standalone KBS and LBS fine-tuned on a real-world dataset. Our results show that these hybrid strategies enhance the relevance, completeness, and overall quality of review comments, effectively bridging the gap between rule-based tools and deep learning models.

Paper Structure

This paper contains 27 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Different strategies to combine learning and knowledge-based systems
  • Figure 2: Proposed strategies to combine LBS and KBS
  • Figure 3: Dataset augmentation pipeline
  • Figure 4: Judgment of review comments using Llama3-70B
  • Figure 5: Distribution of LBS and KBS Reviews in Our Dataset
  • ...and 6 more figures