Table of Contents
Fetching ...

Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model

Fahim Faisal, Kaiqiang Song, Song Wang, Simin Ma, Shujian Liu, Haoyun Deng, Sathish Reddy Indurthi

TL;DR

PB-RLSVR addresses the multilingual reasoning gap by leveraging an English expert as a verifiable pivot to supervise reasoning in target languages without target-language annotations. It introduces a hybrid semantic reward combining COMET-based answer precision and embedding/translation-based reasoning coherence, integrated via a GRPO-based policy optimization loop. Empirical results on Llama-3.1-8B-Instruct and Qwen-3-32B show substantial gains over SFT and PPO baselines, with notable reduction of English–non-English gaps and strong zero-shot transfer to unseen languages. The approach offers a scalable path to truly multilingual reasoning systems, with potential extensions to other modalities and curriculum-based reductions of pivot reliance.

Abstract

While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a "pivot" model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model's reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents.

Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model

TL;DR

PB-RLSVR addresses the multilingual reasoning gap by leveraging an English expert as a verifiable pivot to supervise reasoning in target languages without target-language annotations. It introduces a hybrid semantic reward combining COMET-based answer precision and embedding/translation-based reasoning coherence, integrated via a GRPO-based policy optimization loop. Empirical results on Llama-3.1-8B-Instruct and Qwen-3-32B show substantial gains over SFT and PPO baselines, with notable reduction of English–non-English gaps and strong zero-shot transfer to unseen languages. The approach offers a scalable path to truly multilingual reasoning systems, with potential extensions to other modalities and curriculum-based reductions of pivot reliance.

Abstract

While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a "pivot" model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model's reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents.

Paper Structure

This paper contains 35 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Performance of Llama-3.1-8B-Instruct and Qwen3-32B models across languages. On MGSM, Llama-3.1-8B-Instruct accuracy declines from 82.3% in English to 68% in Chinese. On MMLU-ProX, Qwen3-32B scores drop from 71.8% in English to 61.5% in Hindi. These results highlight a substantial multilingual reasoning gap.
  • Figure 2: An overview of our Pivot-Based Reinforcement Learning with Verifiable Rewards (PB-RLSVR) framework. The policy model generates a response in a target language, which is evaluated against a trusted English-language reference to compute a reward signal for policy optimization.
  • Figure 3: Per-language performance on languages present in the training set. Our PB-RLSVR method (solid red line) significantly closes the performance gap between English and non-English languages compared to the baseline models (dashed blue line).
  • Figure 4: Five-shot performance on six out-of-distribution languages from MMLU-ProX. Our PB-RLSVR method (red) consistently improves reasoning performance over the respective baseline models (blue) for both the 8B and 32B scales, highlighting strong cross-lingual generalization.
  • Figure 5: Qualitative comparison on a mathematical reasoning task in Spanish. The baseline model makes a calculation error, while the PB-RLSVR model correctly follows the logical steps outlined in the English reference.
  • ...and 1 more figures