Table of Contents
Fetching ...

Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning

Rafael Pardinas, Ehsan Kamalloo, David Vazquez, Alexandre Drouin

Abstract

Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.

Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning

Abstract

Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.

Paper Structure

This paper contains 41 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Higher accuracy / shorter traces for Apriel-Reasoner under 32K output budget.
  • Figure 2: PipelineRL with the multi-domain extensions used in this work. A domain-weighted sampler draws prompts from the five training environments, environment-specific verifiers score the resulting rollouts, and the data flow through the actor, preprocessor, and trainer stages for asynchronous RL post-training.
  • Figure 3: Average number of reasoning steps per trace (\ref{['fig:mean_num_steps']}) and average number of tokens per step (\ref{['fig:avg_tokens_per_step']}) for Apriel-Base and Apriel-Reasoner on AIME 2025. The number of steps is comparable, but Apriel-Reasoner expresses each step more concisely.
  • Figure 4: Mean reward progress during training for each domain.
  • Figure 5: Step-type distribution for Apriel-Base and Apriel-Reasoner on AIME 2025. Apriel-Reasoner produces fewer verification and non-productive steps.
  • ...and 1 more figures