Table of Contents
Fetching ...

Memory-Efficient Large Language Models for Program Repair with Semantic-Guided Patch Generation

Thanh Le-Cong, Bach Le, Toby Murray

TL;DR

This work addresses a key bottleneck in memory usage for large language model–based automated program repair (APR) when increasing beam size. It shows that traditional memory-reduction tactics like quantization and sequential beam search fail to prevent crashes at scale, motivating a novel approach that blends LLM patch generation with semantic feedback. The proposed FLAMES framework uses a semantic-guided best-first search (PG-TD) to steer decoding, achieving substantial memory savings (up to 83%), maintaining or improving speed, and delivering more correct patches across Defects4J, HumanEval-Java, and TransformedD4J. Overall, FLAMES demonstrates strong repair effectiveness, better memory efficiency, and generalizability across datasets, signaling a practical path toward scalable, memory-conscious LLM-based APR.

Abstract

In this paper, we first show that increases in beam size, even for small-sized LLMs (1B-7B params), require extensive GPU usage, leading to up to 80% of recurring crashes due to memory overloads in LLM-based APR. Seemingly simple solutions to reduce memory consumption are (1) to quantize LLM models, i.e., converting the weights of an LLM from high-precision values to lower-precision ones, and (2) to make beam search sequential, i.e., forwarding each beam through the model sequentially and then concatenating them back into a single output. However, we show that these approaches still do not work via both theoretical analysis and experiments. To address this, we introduce FLAMES, a novel LLM-based APR technique that employs semantic-guided patch generation to enhance repair effectiveness and memory efficiency. Unlike conventional methods that rely on beam search, FLAMES utilizes greedy decoding to enhance memory efficiency while steering the search towards more potentially good repair candidates via a semantic-guided best-first search algorithm. At each decoding step, FLAMES uses semantic feedback from test validation, such as the number of passing and failing test cases, to select the most promising token to explore further. Our empirical evaluation on Defects4J shows thatFLAMES substantially reduces memory consumption by up to 83% compared to LLM-based APR without compromising time efficiency. Moreover, FLAMES correctly fixes 133 bugs on Defects4J, fixing 10 bugs more than the best baseline. Additionally, these improvements also generalize to the HumanEval-Java and TransformedD4J datasets, where FLAMES generates 12% and 36.5% more correct patches, respectively, than the best baseline.

Memory-Efficient Large Language Models for Program Repair with Semantic-Guided Patch Generation

TL;DR

This work addresses a key bottleneck in memory usage for large language model–based automated program repair (APR) when increasing beam size. It shows that traditional memory-reduction tactics like quantization and sequential beam search fail to prevent crashes at scale, motivating a novel approach that blends LLM patch generation with semantic feedback. The proposed FLAMES framework uses a semantic-guided best-first search (PG-TD) to steer decoding, achieving substantial memory savings (up to 83%), maintaining or improving speed, and delivering more correct patches across Defects4J, HumanEval-Java, and TransformedD4J. Overall, FLAMES demonstrates strong repair effectiveness, better memory efficiency, and generalizability across datasets, signaling a practical path toward scalable, memory-conscious LLM-based APR.

Abstract

In this paper, we first show that increases in beam size, even for small-sized LLMs (1B-7B params), require extensive GPU usage, leading to up to 80% of recurring crashes due to memory overloads in LLM-based APR. Seemingly simple solutions to reduce memory consumption are (1) to quantize LLM models, i.e., converting the weights of an LLM from high-precision values to lower-precision ones, and (2) to make beam search sequential, i.e., forwarding each beam through the model sequentially and then concatenating them back into a single output. However, we show that these approaches still do not work via both theoretical analysis and experiments. To address this, we introduce FLAMES, a novel LLM-based APR technique that employs semantic-guided patch generation to enhance repair effectiveness and memory efficiency. Unlike conventional methods that rely on beam search, FLAMES utilizes greedy decoding to enhance memory efficiency while steering the search towards more potentially good repair candidates via a semantic-guided best-first search algorithm. At each decoding step, FLAMES uses semantic feedback from test validation, such as the number of passing and failing test cases, to select the most promising token to explore further. Our empirical evaluation on Defects4J shows thatFLAMES substantially reduces memory consumption by up to 83% compared to LLM-based APR without compromising time efficiency. Moreover, FLAMES correctly fixes 133 bugs on Defects4J, fixing 10 bugs more than the best baseline. Additionally, these improvements also generalize to the HumanEval-Java and TransformedD4J datasets, where FLAMES generates 12% and 36.5% more correct patches, respectively, than the best baseline.

Paper Structure

This paper contains 45 sections, 2 figures, 7 tables, 1 algorithm.

Figures (2)

  • Figure 1: Memory usage and effectiveness of LLM-based APR techniques across beam sizes on an NVIDIA A100 (80GB) using full-precision (FP) and quantized (Q) models. BS and SeqBS denote standard and sequential beam search.
  • Figure 2: Overview of FLAMES