Table of Contents
Fetching ...

Summary-Mediated Repair: Can LLMs use code summarisation as a tool for program repair?

Lukas Twist

TL;DR

The paper addresses the challenge of subtle, implementation-level bugs in code generated by large language models (LLMs) and proposes summary-mediated repair, a prompt-only pipeline that uses natural-language code summaries as an intermediate artefact to guide repair. The method operates in two stages: first, summarize the code to capture intended behavior, then generate repaired code conditioned on that summary, without relying on formal specifications. Evaluations across eight LLMs on function-level benchmarks (HumanEvalPack and MBPP) show that error-aware summaries yield the largest gains, repairing up to 65% of unseen errors but with modest, model-dependent improvements overall. The work positions code summaries as a cheap, interpretable artefact to augment program repair pipelines, and provides public release of code and results to spur further research in this direction.

Abstract

Large Language Models (LLMs) often produce code with subtle implementation-level bugs despite strong benchmark performance. These errors are hard for LLMs to spot and can have large behavioural effects; yet when asked to summarise code, LLMs can frequently surface high-level intent and sometimes overlook this low-level noise. Motivated by this, we propose summary-mediated repair, a prompt-only pipeline for program repair that leverages natural-language code summarisation as an explicit intermediate step, extending previous work that has already shown code summarisation to be a useful intermediary for downstream tasks. We evaluate our method across eight production-grade LLMs on two function level benchmarks (HumanEvalPack and MBPP), comparing several summary styles against a direct repair baseline. Error-aware diagnostic summaries consistently yield the largest gains - repairing up to 65% of unseen errors, on average of 5% more than the baseline - though overall improvements are modest and LLM-dependent. Our results position summaries as a cheap, human-interpretable diagnostic artefact that can be integrated into program-repair pipelines rather than a stand-alone fix-all.

Summary-Mediated Repair: Can LLMs use code summarisation as a tool for program repair?

TL;DR

The paper addresses the challenge of subtle, implementation-level bugs in code generated by large language models (LLMs) and proposes summary-mediated repair, a prompt-only pipeline that uses natural-language code summaries as an intermediate artefact to guide repair. The method operates in two stages: first, summarize the code to capture intended behavior, then generate repaired code conditioned on that summary, without relying on formal specifications. Evaluations across eight LLMs on function-level benchmarks (HumanEvalPack and MBPP) show that error-aware summaries yield the largest gains, repairing up to 65% of unseen errors but with modest, model-dependent improvements overall. The work positions code summaries as a cheap, interpretable artefact to augment program repair pipelines, and provides public release of code and results to spur further research in this direction.

Abstract

Large Language Models (LLMs) often produce code with subtle implementation-level bugs despite strong benchmark performance. These errors are hard for LLMs to spot and can have large behavioural effects; yet when asked to summarise code, LLMs can frequently surface high-level intent and sometimes overlook this low-level noise. Motivated by this, we propose summary-mediated repair, a prompt-only pipeline for program repair that leverages natural-language code summarisation as an explicit intermediate step, extending previous work that has already shown code summarisation to be a useful intermediary for downstream tasks. We evaluate our method across eight production-grade LLMs on two function level benchmarks (HumanEvalPack and MBPP), comparing several summary styles against a direct repair baseline. Error-aware diagnostic summaries consistently yield the largest gains - repairing up to 65% of unseen errors, on average of 5% more than the baseline - though overall improvements are modest and LLM-dependent. Our results position summaries as a cheap, human-interpretable diagnostic artefact that can be integrated into program-repair pipelines rather than a stand-alone fix-all.

Paper Structure

This paper contains 27 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Summary-Mediated Repair. Our proposed pipeline for a prompt-only APR method, where a code summary is generated as an intermediate artefact. Full details in Section \ref{['sec:method']}.