Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
Seamus Somerstep, Vinod Raman, Unique Subedi, Yuekai Sun
TL;DR
The paper provides a formal, information-theoretic comparison of Best-of-N BoN and supervised fine-tuning for bit-string autoregressive tasks, highlighting realizable versus agnostic learning and two inference-time regimes. In the realizable setting, SFT achieves convergence to near-perfect reward with a rate nearly independent of the response length $T$, while BoN incurs a linear-in-$T$ term in its convergence, implying SFT can be preferable when the target function lies within the hypothesis class. In agnostic scenarios, BoN can maintain positive reward under broader conditions, whereas SFT may fail depending on the misalignment between training-time teacher forcing and test-time autoregressive generation. The results quantify precise bounds in terms of VC dimensions and a coverage constant $\alpha$, clarifying when inference-time BoN or training-time SFT will dominate for task adaptation of large language models. These findings illuminate the tradeoffs between reward-based selection and supervised data fitting for efficient task adaptation in LLMs, with implications for design choices in real-world alignment and instruction-following systems.
Abstract
Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as supervised fine-tuning, involves training a new next token predictor on good generations. The second method, Best-of-N, trains a reward model to select good responses from a collection generated by an unaltered base model. If the learning setting is realizable, we find that supervised fine-tuning outperforms BoN through a better dependence on the response length in its rate of convergence. If realizability fails, then depending on the failure mode, BoN can enjoy a better rate of convergence in either n or a rate of convergence with better dependence on the response length.
