Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
Zhanhui Zhou, Zhixuan Liu, Jie Liu, Zhichen Dong, Chao Yang, Yu Qiao
TL;DR
Weak-to-strong search reframes LLM alignment as a test-time, search-based decoding problem that uses the log-probability difference between small tuned and untuned LMs to steer a frozen large LM. By deriving a dense per-token reward and a value function from this log-ratio, the method enables a practical Chunk-level Beam Search (CBS) that balances reward maximization with KL constraints and supports both white-box and black-box models. Empirically, CBS delivers strong improvements on controlled-sentiment generation, summarization, and instruction following, including notable gains against strong baselines and even GPT-4-turbo in instruction-following benchmarks. The approach offers a compute-efficient model up-scaling path and demonstrates weak-to-strong generalization, where weak guidance enhances the performance of substantially larger models without additional training. This has practical impact for deploying aligned LLMs at scale using readily-available small-model guidance without retraining large systems.
Abstract
Large language models are usually fine-tuned to align with human preferences. However, fine-tuning a large language model can be challenging. In this work, we introduce $\textit{weak-to-strong search}$, framing the alignment of a large language model as a test-time greedy search to maximize the log-probability difference between small tuned and untuned models while sampling from the frozen large model. This method serves both as (1) a compute-efficient model up-scaling strategy that avoids directly tuning the large model and as (2) an instance of weak-to-strong generalization that enhances a strong model with weak test-time guidance. Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned $\texttt{gpt2}$s to improve the alignment of large models without additional training. Crucially, in a more difficult instruction-following benchmark, AlpacaEval 2.0, we show that reusing off-the-shelf small models (e.g., $\texttt{zephyr-7b-beta}$ and its untuned version) can improve the length-controlled win rates of both white-box and black-box large models against $\texttt{gpt-4-turbo}$ (e.g., $34.4\% \rightarrow 37.9\%$ for $\texttt{Llama-3-70B-Instruct}$ and $16.0\% \rightarrow 20.1\%$ for $\texttt{gpt-3.5-turbo-instruct}$), despite the small models' low win rates $\approx 10.0\%$.
