Table of Contents
Fetching ...

EXAONE Deep: Reasoning Enhanced Language Models

Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun

TL;DR

EXAONE Deep introduces three reasoning-focused LLMs at $2.4B$, $7.8B$, and $32B$ that are fine-tuned with supervised fine-tuning, direct preference optimization, and online reinforcement learning to enhance chain-of-thought reasoning. The data strategy emphasizes long CoT sequences, including a large SFT corpus (~$12 ext{B}$ tokens) and targeted preference/ reinforcement datasets, with training performed on NVIDIA H100 hardware and detailed FLOP budgets. Across benchmarks such as MATH-500, AIME, CSAT, GPQA Diamond, LiveCodeBench, and MMLU/MMLU-Pro, the $32B$ model remains competitive with leading open-weight models, the $7.8B$ model often surpasses similarly sized baselines, and the $2.4B$ variant outperforms distilled counterparts, highlighting strong reasoning capabilities at multiple scales. The work emphasizes research-oriented deployment under a license, and suggests extending reasoning capabilities to tasks with less well-defined answers in future work.

Abstract

We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Deep models are openly available for research purposes and can be downloaded from https://huggingface.co/LGAI-EXAONE

EXAONE Deep: Reasoning Enhanced Language Models

TL;DR

EXAONE Deep introduces three reasoning-focused LLMs at , , and that are fine-tuned with supervised fine-tuning, direct preference optimization, and online reinforcement learning to enhance chain-of-thought reasoning. The data strategy emphasizes long CoT sequences, including a large SFT corpus (~ tokens) and targeted preference/ reinforcement datasets, with training performed on NVIDIA H100 hardware and detailed FLOP budgets. Across benchmarks such as MATH-500, AIME, CSAT, GPQA Diamond, LiveCodeBench, and MMLU/MMLU-Pro, the model remains competitive with leading open-weight models, the model often surpasses similarly sized baselines, and the variant outperforms distilled counterparts, highlighting strong reasoning capabilities at multiple scales. The work emphasizes research-oriented deployment under a license, and suggests extending reasoning capabilities to tasks with less well-defined answers in future work.

Abstract

We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Deep models are openly available for research purposes and can be downloaded from https://huggingface.co/LGAI-EXAONE

Paper Structure

This paper contains 17 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overall performance comparison. EXAONE Deep 32B model demonstrates competitive performance compared to leading open-weight reasoning models such as QwQ-32B and DeepSeek-R1. It also outperforms both DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Llama-70B. The lightly colored regions in AIME 2024 and 2025 show the performance of majority vote (consensus).
  • Figure 2: Distribution of token counts of the SFT dataset. Data points in the Code domain are notably longer on average, whereas those in the Others domain tend to be shorter.
  • Figure 3: An example of SFT dataset. The dataset is specifically designed to facilitate models in conducting reasoning tasks through an extended chain-of-thought methodology.
  • Figure 4: Prompt for evaluating models on short-answer questions. We apply the prompt to the MATH-500, AIME 2024/2025, and CSAT 2025 benchmarks.
  • Figure 5: Prompt used for evaluating EXAONE Deep models on multiple-choice questions. We apply the prompt to the CSAT 2025, GPQA Diamond, MMLU, and MMLU-Pro benchmarks. The number of options is adjusted for each test case.
  • ...and 1 more figures