Code Review Automation Via Multi-task Federated LLM -- An Empirical Study
Jahnavi Kumar, Sridhar Chimalakonda
TL;DR
This study tackles automating code review by learning three interrelated tasks—Review Necessity Prediction, Review Comment Generation, and Code Refinement—via a privacy-preserving federated learning setup using LoRA-tuned LLaMA-3. It evaluates five training strategies spanning sequential, parallel, and cumulative multi-task configurations, comparing federated models against a central baseline on the CodeReviewer dataset across nine languages. The key findings show that sequential multi-task FedLLMs suffer catastrophic forgetting, while cumulative fine-tuning approaches, particularly FedCFT-reg, achieve competitive or superior performance to single-task models, with notable gains for low-data clients on T1 and T3 and modest gains on T2. The work highlights both the potential and challenges of multi-task FedLLMs in software engineering, pointing to continual-learning techniques and stronger privacy-preserving methods as important directions for future research. Overall, the results support the feasibility of privacy-aware collaborative fine-tuning for SE tasks, enabling robust, multi-task code-review automation across organizations without exposing proprietary code.
Abstract
Code review is a crucial process before deploying code to production, as it validates the code, provides suggestions for improvements, and identifies errors such as missed edge cases. In projects with regular production releases, the effort required for peer code-reviews remains high. Consequently, there has been significant interest from software engineering (SE) researchers in automating the code review process. Previous research on code review automation has typically approached the task as three independent sub-tasks: review necessity prediction, review comment generation, and code refinement. Our study attempts to (i) leverage the relationships between the sub-tasks of code review automation, by developing a multi-task model that addresses all tasks in an integrated manner, and (ii) increase model robustness on unseen data via collaborative large language model (LLM) modeling, while retaining the proprietary nature of code, by using federated learning (FL). The study explores five simple techniques for multi-task training, including two sequential methods, one parallel method, and two cumulative methods. The results indicate that sequentially training a federated LLM (FedLLM) for our code review multi-task use case is less efficient in terms of time, computation, and performance metrics, compared to training separate models for each task. Because sequential training demonstrates catastrophic forgetting, alternatively cumulative fine-tuning for multi-task training performs better than training models for individual tasks. This study highlights the need for research focused on effective fine-tuning of multi-task FedLLMs for SE tasks.
