Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models

Zhanhui Zhou; Zhixuan Liu; Jie Liu; Zhichen Dong; Chao Yang; Yu Qiao

Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models

Zhanhui Zhou, Zhixuan Liu, Jie Liu, Zhichen Dong, Chao Yang, Yu Qiao

TL;DR

Weak-to-strong search reframes LLM alignment as a test-time, search-based decoding problem that uses the log-probability difference between small tuned and untuned LMs to steer a frozen large LM. By deriving a dense per-token reward and a value function from this log-ratio, the method enables a practical Chunk-level Beam Search (CBS) that balances reward maximization with KL constraints and supports both white-box and black-box models. Empirically, CBS delivers strong improvements on controlled-sentiment generation, summarization, and instruction following, including notable gains against strong baselines and even GPT-4-turbo in instruction-following benchmarks. The approach offers a compute-efficient model up-scaling path and demonstrates weak-to-strong generalization, where weak guidance enhances the performance of substantially larger models without additional training. This has practical impact for deploying aligned LLMs at scale using readily-available small-model guidance without retraining large systems.

Abstract

Large language models are usually fine-tuned to align with human preferences. However, fine-tuning a large language model can be challenging. In this work, we introduce $\textit{weak-to-strong search}$, framing the alignment of a large language model as a test-time greedy search to maximize the log-probability difference between small tuned and untuned models while sampling from the frozen large model. This method serves both as (1) a compute-efficient model up-scaling strategy that avoids directly tuning the large model and as (2) an instance of weak-to-strong generalization that enhances a strong model with weak test-time guidance. Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned $\texttt{gpt2}$s to improve the alignment of large models without additional training. Crucially, in a more difficult instruction-following benchmark, AlpacaEval 2.0, we show that reusing off-the-shelf small models (e.g., $\texttt{zephyr-7b-beta}$ and its untuned version) can improve the length-controlled win rates of both white-box and black-box large models against $\texttt{gpt-4-turbo}$ (e.g., $34.4\% \rightarrow 37.9\%$ for $\texttt{Llama-3-70B-Instruct}$ and $16.0\% \rightarrow 20.1\%$ for $\texttt{gpt-3.5-turbo-instruct}$), despite the small models' low win rates $\approx 10.0\%$.

Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models

TL;DR

Abstract

Large language models are usually fine-tuned to align with human preferences. However, fine-tuning a large language model can be challenging. In this work, we introduce

, framing the alignment of a large language model as a test-time greedy search to maximize the log-probability difference between small tuned and untuned models while sampling from the frozen large model. This method serves both as (1) a compute-efficient model up-scaling strategy that avoids directly tuning the large model and as (2) an instance of weak-to-strong generalization that enhances a strong model with weak test-time guidance. Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned

s to improve the alignment of large models without additional training. Crucially, in a more difficult instruction-following benchmark, AlpacaEval 2.0, we show that reusing off-the-shelf small models (e.g.,

and its untuned version) can improve the length-controlled win rates of both white-box and black-box large models against

(e.g.,

for

and

for

), despite the small models' low win rates

Paper Structure (43 sections, 13 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 43 sections, 13 equations, 9 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Aligning Language Models with Human Preferences
Duality between Language Models and Reward Functions
Weak-to-Strong Search
Language Models as Both Reward and Value Functions
Language models as a dense reward function.
Cumulative reward under language models as a value function rafailov2024r.
Chunk-level Beam Search (CBS)
Application: Model Up-Scaling and Weak-to-Strong Generalization
Experiments
Baselines.
Controlled-Sentiment Generation & Summarization
Setup.
...and 28 more sections

Figures (9)

Figure 1: Weak-to-strong search enhances the alignment of large models through test-time guidance from small models (dashed lines). This method is applicable to white-box models that use the same or different vocabularies as the small models, as well as to black-box models. We present the results for the instruction-tuned models from each family (e.g., Llama2-7B denotes Llama-2-7b-chat).
Figure 2: Illustration of Chunk-level Beam Search with $W,K=2,2$.
Figure 3: The gold reward achieved for different large pre-trained models under the gpt2 guidance. We show the mean reward ($\pm$ standard deviations) across three random seeds. EFT ($\beta^*$) denotes the best EFT results among $\beta \in \{1/4, 1/2, 1, 2, 4\}$; Weak-to-strong search $(4,4,5)$ denotes CBS with $W,K,L=4,4,5$; BoN ($16$) denotes BoN with $N=16$.
Figure 4: W, K ablations for CBS ($\mathbf{L=5}$). We show the mean rewards across three random seeds. With the same computation budget (i.e., same $W K$), the optimal hyperparameters differ by tasks.
Figure 5: L ablations for CBS (W, K = 4, 4). We show the mean rewards ($\pm$ standard deviations) across three random seeds.
...and 4 more figures

Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models

TL;DR

Abstract

Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)