Small Language Models Improve Giants by Rewriting Their Outputs

Giorgos Vernikos; Arthur Bražinskas; Jakub Adamek; Jonathan Mallinson; Aliaksei Severyn; Eric Malmi

Small Language Models Improve Giants by Rewriting Their Outputs

Giorgos Vernikos, Arthur Bražinskas, Jakub Adamek, Jonathan Mallinson, Aliaksei Severyn, Eric Malmi

TL;DR

The paper introduces LMCor, a compact corrector that improves LLM outputs at inference by combining and editing multiple LLM-generated candidates without accessing the LLM weights. Through experiments on grammatical error correction, data-to-text generation, abstractive summarization, and machine translation, LMCor (as small as 250M) matches or surpasses task-specific fine-tuning on several tasks and shows robustness to prompt variations. The approach leverages candidate diversity via a learnable fusion, offering a plug-and-play module that generalizes across LLMs and tasks, while incurring some latency. This work demonstrates a practical, resource-efficient path to harness the strengths of large models alongside smaller, task-tuned components.

Abstract

Despite the impressive performance of large language models (LLMs), they often lag behind specialized models in various tasks. LLMs only use a fraction of the existing training data for in-context learning, while task-specific models harness the full dataset for fine-tuning. In this work, we tackle the problem of leveraging training data to improve the performance of LLMs without fine-tuning. Our approach directly targets LLM predictions without requiring access to their weights. We create a pool of candidates from the LLM through few-shot prompting and we employ a compact model, the LM-corrector (LMCor), specifically trained to merge these candidates to produce an enhanced output. Our experiments on four natural language generation tasks demonstrate that even a small LMCor model (250M) substantially improves the few-shot performance of LLMs (62B), matching and even outperforming standard fine-tuning. Furthermore, we illustrate the robustness of LMCor against different prompts, thereby minimizing the need for extensive prompt engineering. Finally, we show that LMCor can be seamlessly integrated with different LLMs at inference, serving as a plug-and-play module to improve their performance.

Small Language Models Improve Giants by Rewriting Their Outputs

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 8 figures, 9 tables)

This paper contains 25 sections, 2 equations, 8 figures, 9 tables.

Introduction
Correcting the Outputs of LLMs
Headroom analysis
Generating the candidates
Correcting the candidates
Experiments & Results
Datasets and Models
Grammatical Error Correction
Data-to-text
Summarization
Machine Translation
Robustness Analysis
Different prompts
Different LLMs
Task-specific models
...and 10 more sections

Figures (8)

Figure 1: An illustration of our approach for grammatical error correction. We first prompt an LLM to generate multiple outputs via an API (dotted lines). Then we feed the generated candidates to the LM-corrector, a small model that is trained to rewrite them in order to generate the target sentence (solid lines).
Figure 2: Potential of ranking (oracle-rank) and combining (oracle-combine) sampled candidates (k=10) from PaLM models of different scales for GEC.
Figure 3: The effect of dataset size for standard fine-tuning and LMCor. Results are reported on GEC.
Figure 4: The effect of scaling for LMCor and fine-tuning. Results are reported on GEC.
Figure 5: LLM prompt for GEC.
...and 3 more figures

Small Language Models Improve Giants by Rewriting Their Outputs

TL;DR

Abstract

Small Language Models Improve Giants by Rewriting Their Outputs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)