Table of Contents
Fetching ...

Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation

Ashim Gupta, Vivek Srikumar

TL;DR

This paper investigates whether test-time scaling through repeated sampling improves multilingual text generation. It compares perplexity-based and reward-based verifiers across two multilingual benchmarks (Aya and m-ArenaHard) and multiple open-weight LLMs, showing robust gains in multilingual generation, with perplexity-based verifiers excelling on open-ended prompts and reward-based verifiers better supporting reasoning tasks. The Gemma-2B reward model emerges as the strongest verifier, delivering substantial gains even with relatively small models, while verifier choice proves critical and task-language dependent. The work demonstrates the practical utility of repeated sampling for multilingual generation and highlights the need for adaptive, multilingual reward modeling to further enhance performance across diverse languages and tasks.

Abstract

Inference-time scaling via repeated sampling has shown promise in reasoning tasks, but its effectiveness in multilingual generation remains underexplored. We evaluate this approach using perplexity- and reward-based verifiers on two multilingual benchmarks: the Aya Evaluation Suite and m-ArenaHard. Our results show consistent quality improvements, with gains exceeding 35% in some cases. While perplexity-based scoring is effective for open-ended prompts, only reward-based verifiers improve performance on tasks requiring reasoning (e.g., math, code). Our results demonstrate the broader utility of repeated sampling for multilingual text generation and underscore the importance of selecting right verifiers for the task.

Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation

TL;DR

This paper investigates whether test-time scaling through repeated sampling improves multilingual text generation. It compares perplexity-based and reward-based verifiers across two multilingual benchmarks (Aya and m-ArenaHard) and multiple open-weight LLMs, showing robust gains in multilingual generation, with perplexity-based verifiers excelling on open-ended prompts and reward-based verifiers better supporting reasoning tasks. The Gemma-2B reward model emerges as the strongest verifier, delivering substantial gains even with relatively small models, while verifier choice proves critical and task-language dependent. The work demonstrates the practical utility of repeated sampling for multilingual generation and highlights the need for adaptive, multilingual reward modeling to further enhance performance across diverse languages and tasks.

Abstract

Inference-time scaling via repeated sampling has shown promise in reasoning tasks, but its effectiveness in multilingual generation remains underexplored. We evaluate this approach using perplexity- and reward-based verifiers on two multilingual benchmarks: the Aya Evaluation Suite and m-ArenaHard. Our results show consistent quality improvements, with gains exceeding 35% in some cases. While perplexity-based scoring is effective for open-ended prompts, only reward-based verifiers improve performance on tasks requiring reasoning (e.g., math, code). Our results demonstrate the broader utility of repeated sampling for multilingual text generation and underscore the importance of selecting right verifiers for the task.

Paper Structure

This paper contains 21 sections, 8 figures.

Figures (8)

  • Figure 1: Repeated sampling procedure using a verifier to pick the final answer.
  • Figure 2: Test-time scaling with repeated sampling for Aya Evaluation Suite. The plots show the difference between win and loss rates (delta). We see that all verifiers --- both perplexity-based (PPL) and reward-based (RM) --- can improve generation quality.
  • Figure 3: Test-time scaling with repeated sampling for m-ArenaHard. The plots show the difference between win and loss rates (delta). Only reward-model based verifiers improve generation quality.
  • Figure 4: Training-time compute vs Test-time compute.
  • Figure 5: Impact of baseline used for win rate calculation. We use gemini-2.0-flash as the judge model and evaluate stability with a reward model as a verifier: URM-LLaMa-3.1-8B
  • ...and 3 more figures