Summary-Mediated Repair: Can LLMs use code summarisation as a tool for program repair?
Lukas Twist
TL;DR
The paper addresses the challenge of subtle, implementation-level bugs in code generated by large language models (LLMs) and proposes summary-mediated repair, a prompt-only pipeline that uses natural-language code summaries as an intermediate artefact to guide repair. The method operates in two stages: first, summarize the code to capture intended behavior, then generate repaired code conditioned on that summary, without relying on formal specifications. Evaluations across eight LLMs on function-level benchmarks (HumanEvalPack and MBPP) show that error-aware summaries yield the largest gains, repairing up to 65% of unseen errors but with modest, model-dependent improvements overall. The work positions code summaries as a cheap, interpretable artefact to augment program repair pipelines, and provides public release of code and results to spur further research in this direction.
Abstract
Large Language Models (LLMs) often produce code with subtle implementation-level bugs despite strong benchmark performance. These errors are hard for LLMs to spot and can have large behavioural effects; yet when asked to summarise code, LLMs can frequently surface high-level intent and sometimes overlook this low-level noise. Motivated by this, we propose summary-mediated repair, a prompt-only pipeline for program repair that leverages natural-language code summarisation as an explicit intermediate step, extending previous work that has already shown code summarisation to be a useful intermediary for downstream tasks. We evaluate our method across eight production-grade LLMs on two function level benchmarks (HumanEvalPack and MBPP), comparing several summary styles against a direct repair baseline. Error-aware diagnostic summaries consistently yield the largest gains - repairing up to 65% of unseen errors, on average of 5% more than the baseline - though overall improvements are modest and LLM-dependent. Our results position summaries as a cheap, human-interpretable diagnostic artefact that can be integrated into program-repair pipelines rather than a stand-alone fix-all.
