Reasoning Transfer for an Extremely Low-Resource and Endangered Language: Bridging Languages Through Sample-Efficient Language Understanding
Khanh-Tung Tran, Barry O'Sullivan, Hoang D. Nguyen
TL;DR
The paper tackles reasoning transfer to extremely low-resource languages by introducing English-Pivoted CoT Training, which constrains chain-of-thought traces to English while keeping inputs and outputs in the target language. By decomposing the objective into English-CoT generation and target-language final answer generation, the method demonstrates robust cross-lingual reasoning with limited data, supported by the new Irish LC2024 benchmark. Key findings include substantial gains over baselines (up to 28.33 percentage points on Irish AIME2024 and 73.33% on LC2024) and evidence that separating language understanding from reasoning improves cross-lingual transfer; ablations reveal the approach’s effectiveness across low-, medium-, and high-resource languages, with varying generalizability. The work offers a practical pathway for multilingual reasoning without extensive retraining per language and contributes a valuable dataset for evaluating mathematical reasoning in Irish.
Abstract
Recent advances have enabled Large Language Models (LLMs) to tackle reasoning tasks by generating chain-of-thought (CoT) rationales, yet these gains have largely applied to high-resource languages, leaving low-resource languages behind. In this work, we first investigate CoT techniques in extremely low-resource scenarios through previous prompting, model-editing, and fine-tuning approaches. We introduce English-Pivoted CoT Training, leveraging the insight that LLMs internally operate in a latent space aligned toward the dominant language. Given input in a low-resource language, we perform supervised fine-tuning to generate CoT in English and output the final response in the target language. Across mathematical reasoning benchmarks, our approach outperforms other baselines with up to 28.33% improvement in low-resource scenarios. Our analysis and additional experiments, including Mixed-Language CoT and Two-Stage Training, show that explicitly separating language understanding from reasoning enhances cross-lingual reasoning abilities. To facilitate future work, we also release \emph{LC2024}, the first benchmark for mathematical tasks in Irish, an extremely low-resource and endangered language. Our results and resources highlight a practical pathway to multilingual reasoning without extensive retraining in every extremely low-resource language, despite data scarcity.
