Table of Contents
Fetching ...

Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Amit Agarwal, Hyunwoo Ko, Chanuk Lim, Srikant Panda, Minhyuk Kim, Nikunj Drolia, Dasol Choi, Kyong-Ha Lee, Youngjae Yu

TL;DR

The paper tackles language-specific reasoning for mid-resource languages by introducing Language-Mixed CoT, which code-switches between English and the target language during the Think step to reduce translation artifacts. It builds Yi-Sang, a Korean reasoning dataset with 5.79M native prompts and 3.7M long reasoning traces, plus Yi-Sang-HQ with 260k high-yield examples, and trains KO-REAson models (4B–35B) across multiple families, achieving state-of-the-art average performance across nine benchmarks with $64.0 \pm 2.5$. The results demonstrate persistent gains across model sizes and families, as well as cross-lingual and multi-modal benefits despite training primarily on Korean data, indicating a viable path for open, language-specific reasoning research. Overall, the work provides practical recipes for building reasoning models in mid-resource languages, including data collection pipelines, supervision strategies, and publicly released models/datasets to advance multilingual reasoning research.

Abstract

Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct **Language-Mixed CoT**, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate **Yi-Sang**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score (64.0 \pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show **Language-Mixed CoT** is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: https://huggingface.co/KOREAson.

Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

TL;DR

The paper tackles language-specific reasoning for mid-resource languages by introducing Language-Mixed CoT, which code-switches between English and the target language during the Think step to reduce translation artifacts. It builds Yi-Sang, a Korean reasoning dataset with 5.79M native prompts and 3.7M long reasoning traces, plus Yi-Sang-HQ with 260k high-yield examples, and trains KO-REAson models (4B–35B) across multiple families, achieving state-of-the-art average performance across nine benchmarks with . The results demonstrate persistent gains across model sizes and families, as well as cross-lingual and multi-modal benefits despite training primarily on Korean data, indicating a viable path for open, language-specific reasoning research. Overall, the work provides practical recipes for building reasoning models in mid-resource languages, including data collection pipelines, supervision strategies, and publicly released models/datasets to advance multilingual reasoning research.

Abstract

Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct **Language-Mixed CoT**, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate **Yi-Sang**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score (64.0 \pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show **Language-Mixed CoT** is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: https://huggingface.co/KOREAson.

Paper Structure

This paper contains 51 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: (Left) Thinking styles. Red: monolingual CoT carried out entirely in English. Blue: our proposed Language-Mixed CoT, which alternates between English (anchor) and Korean (target). (Right) Performance comparison of KO-REAson-35B (ours, solid line) with DeepSeek-R1-32B, Exaone-Deep-32B, GPT-OSS-20B, and QwQ-32B. KO-REAson-35B achieves top-tier performance, ranking first or second on all tasks.
  • Figure 2: An overview of publicly available Korean datasets.Yi-Sang is larger than any fine-tuning dataset or pretraining corpus, with 6.77B tokens.
  • Figure 3: Category distribution across different stages of the dataset collection. (a) Sources (N=54): counts of the public Q&A and community websites we compiled; categories were manually assigned by the authors based on contextual review. (b) Questions: after crawling, items inherit the category from their source. (c) Responses: after response generation, we added OpenThought guha2025openthoughtsdatarecipesreasoning as an additional source. Colors are shared across panels; centers show total counts.
  • Figure 4: Average scores across HAE-RAE Bench, MCLM, and KMMLU-Redux for Gemma-3-4B and Kanana-1.5-8B under three settings.(a) Augmentation. Option and Style are comparable on Gemma-3-4B (49.3 vs 49.5), while Option has a modest edge on Kanana-1.5-8B (58.8 vs 56.6); neither augmentation is uniformly superior. (b) Teacher (Long CoT). Qwen3-32B yields higher averages than Qwen3-4B (Gemma: 51.8 $>$ 48.6; Kanana: 63.8 $>$ 56.3). (c) Teacher (Short CoT). With shot CoT, Qwen3-32B tops Gemini-2.5-Pro (Gemma: 41.8$>$ 39.7; Kanana: 48.5 $>$ 45.9). Overall, Language-Mixed CoT and using Qwen3-32B as the teacher provide the strongest gains; both augmentation choices offer benefits.
  • Figure 5: Performance of Gemma3-12B and its post-trained variant on English reasoning benchmarks and Korean multimodal benchmarks. KO-REASON-12B, trained only with text supervision, shows consistent gains across all tasks, indicating both cross-lingual and multimodal transfer.
  • ...and 4 more figures