Table of Contents
Fetching ...

Making, not Taking, the Best of N

Ammar Khairi, Daniel D'souza, Marzieh Fadaee, Julia Kreutzer

TL;DR

FusioN reframes generation aggregation from selecting a single best sample to synthesizing across multiple candidates, enabling a collaborative fusion of strengths. Using a fusor LLM, FusioN integrates diverse signals to produce higher-quality outputs than Best-of-N in both test-time scaling and synthetic data generation across 11 languages and multiple tasks. The approach delivers consistent gains, is robust to weaker teacher pools, and yields meaningful downstream improvements after fine-tuning, signaling a practical shift toward polylithic evaluation and deployment of LLM generations. The work broadens the understanding of interaction among multiple generations, offering a simple, training-free fusion mechanism with broad applicability and impact for real-world multilingual NLP systems.

Abstract

Obtaining high-quality generations in modern LLMs has largely been framed as a selection problem: identifying a single winning generation from a diverse pool of N samples, the Best-of-N (BoN). Yet, this approach is inherently zero-sum, discarding diverse and potentially useful information from the pool. Instead, we explore a collaborative setup, where all candidates can potentially contribute to the final winning generation. To this end, we propose Fusion-of-N (FusioN): a method that uses a general LLM judge to synthesize the most informative elements of each sample into a single final answer. We compare FusioN to BoN in two settings, (i) test-time scaling, where we sample and aggregate from a single model at test-time (ii) synthetic data generation, where we fuse samples from a pool of diverse teachers to improve a student model. We extensively benchmark both setups across 11 languages, 3 diverse tasks and varying model scales. Across the bench, FusioN consistently outperforms BoN showing versatility and robustness both in test-time scaling and in downstream gains from synthetic data generation. We also perform extensive analysis on FusioN, where it shows surprising strengths and robustness under challenging settings. These results show that we should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature. This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.

Making, not Taking, the Best of N

TL;DR

FusioN reframes generation aggregation from selecting a single best sample to synthesizing across multiple candidates, enabling a collaborative fusion of strengths. Using a fusor LLM, FusioN integrates diverse signals to produce higher-quality outputs than Best-of-N in both test-time scaling and synthetic data generation across 11 languages and multiple tasks. The approach delivers consistent gains, is robust to weaker teacher pools, and yields meaningful downstream improvements after fine-tuning, signaling a practical shift toward polylithic evaluation and deployment of LLM generations. The work broadens the understanding of interaction among multiple generations, offering a simple, training-free fusion mechanism with broad applicability and impact for real-world multilingual NLP systems.

Abstract

Obtaining high-quality generations in modern LLMs has largely been framed as a selection problem: identifying a single winning generation from a diverse pool of N samples, the Best-of-N (BoN). Yet, this approach is inherently zero-sum, discarding diverse and potentially useful information from the pool. Instead, we explore a collaborative setup, where all candidates can potentially contribute to the final winning generation. To this end, we propose Fusion-of-N (FusioN): a method that uses a general LLM judge to synthesize the most informative elements of each sample into a single final answer. We compare FusioN to BoN in two settings, (i) test-time scaling, where we sample and aggregate from a single model at test-time (ii) synthetic data generation, where we fuse samples from a pool of diverse teachers to improve a student model. We extensively benchmark both setups across 11 languages, 3 diverse tasks and varying model scales. Across the bench, FusioN consistently outperforms BoN showing versatility and robustness both in test-time scaling and in downstream gains from synthetic data generation. We also perform extensive analysis on FusioN, where it shows surprising strengths and robustness under challenging settings. These results show that we should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature. This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.

Paper Structure

This paper contains 22 sections, 2 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: FusioN principle: Multiple generations (here $N=4$, from one or multiple models) get fused into one final generation combining the strengths of each individual generation.
  • Figure 2: Test-time scaling with $N=5$: FusioN raises win rates against Gemini2.5-Pro on Arena across languages. It largely outperforms BoN with the same set of samples, for both Aya Expanse 8B and Command A models. Gray markers indicate greedy baseline performance.
  • Figure 3: FusioN vs BoN vs Oracle (the highest scoring sample according to the ground truth) in Translation, error bar show std-err. Bars with bold border (German, Russian and Chinese) are cases where FusioN is outperforming the Oracle
  • Figure 4: Downstream evaluation on multilingual factual reasoning on the GeoFactX test set. FusioN outperforms BoN notably in both reasoning quality and answer correctness in 4/5 languages.
  • Figure 5: Size of the fusor matters: Small LLMs might serve well as scalar judges in BoN, but generative fusion capabilities get unlocked at larger scale, here measured in win-rates on Arena, averaged across languages, shaded areas represent std-err.
  • ...and 13 more figures