Table of Contents
Fetching ...

Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

Akhiad Bercovich, Nir Ailon, Vladimir Anisimov, Tomer Asida, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Roi Koren, Itay Levy, Zach Moshe, Pavlo Molchanov, Najeeb Nabwani, Mostofa Patwari, Omri Puny, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv

TL;DR

This work addresses the high inference cost of reasoning-focused LLMs that generate long traces by extending Puzzle, a post-training neural architecture search framework, to optimize deployment for MoE-based models. By jointly pruning MoE experts, selectively converting full-context attention to window attention, applying FP8 KV-cache quantization, and leveraging distillation and reinforcement learning, the authors derive gpt-oss-puzzle-88B, a deployment-optimized derivative of gpt-oss-120B. On an 8×H100 node and on a single GPU, Puzzle-88B achieves substantial throughput gains (up to 1.63× on 64K/64K and 1.22× on 4K/4K at the node level; up to 2.82× on a single GPU) while preserving or slightly improving suite-average reasoning accuracy. The results underscore that end-to-end efficiency for reasoning models depends on request-level metrics that account for tokens generated and reasoning effort, demonstrating that post-training NAS can reduce inference costs without compromising quality and offering practical guidance for deploying large reasoning LLMs. The study contributes a concrete methodology combining MoE pruning, selective window attention, calibrated KV quantization, and RL-based fine-tuning to advance efficient, scalable reasoning in open-weight LLMs.

Abstract

Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X throughput speedups in long-context and short-context settings, respectively. gpt-oss-puzzle-88B also delivers throughput speedups of 2.82X on a single NVIDIA H100 GPU. However, because token counts can change with reasoning effort and model variants, per-token throughput (tok/s) and latency (ms/token) do not necessarily lead to end-to-end speedups: a 2X throughput gain is erased if traces grow 2X. Conversely, throughput gains can be spent on more reasoning tokens to improve accuracy; we therefore advocate request-level efficiency metrics that normalize throughput by tokens generated and trace an accuracy--speed frontier across reasoning efforts. We show that gpt-oss-puzzle-88B improves over gpt-oss-120B along the entire frontier, delivering up to 1.29X higher request-level efficiency. Across various benchmarks, gpt-oss-puzzle-88B matches or slightly exceeds the parent on suite-average accuracy across reasoning efforts, with retention ranging from 100.8% (high) to 108.2% (low), showing that post-training architecture search can substantially reduce inference costs without sacrificing quality.

Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

TL;DR

This work addresses the high inference cost of reasoning-focused LLMs that generate long traces by extending Puzzle, a post-training neural architecture search framework, to optimize deployment for MoE-based models. By jointly pruning MoE experts, selectively converting full-context attention to window attention, applying FP8 KV-cache quantization, and leveraging distillation and reinforcement learning, the authors derive gpt-oss-puzzle-88B, a deployment-optimized derivative of gpt-oss-120B. On an 8×H100 node and on a single GPU, Puzzle-88B achieves substantial throughput gains (up to 1.63× on 64K/64K and 1.22× on 4K/4K at the node level; up to 2.82× on a single GPU) while preserving or slightly improving suite-average reasoning accuracy. The results underscore that end-to-end efficiency for reasoning models depends on request-level metrics that account for tokens generated and reasoning effort, demonstrating that post-training NAS can reduce inference costs without compromising quality and offering practical guidance for deploying large reasoning LLMs. The study contributes a concrete methodology combining MoE pruning, selective window attention, calibrated KV quantization, and RL-based fine-tuning to advance efficient, scalable reasoning in open-weight LLMs.

Abstract

Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X throughput speedups in long-context and short-context settings, respectively. gpt-oss-puzzle-88B also delivers throughput speedups of 2.82X on a single NVIDIA H100 GPU. However, because token counts can change with reasoning effort and model variants, per-token throughput (tok/s) and latency (ms/token) do not necessarily lead to end-to-end speedups: a 2X throughput gain is erased if traces grow 2X. Conversely, throughput gains can be spent on more reasoning tokens to improve accuracy; we therefore advocate request-level efficiency metrics that normalize throughput by tokens generated and trace an accuracy--speed frontier across reasoning efforts. We show that gpt-oss-puzzle-88B improves over gpt-oss-120B along the entire frontier, delivering up to 1.29X higher request-level efficiency. Across various benchmarks, gpt-oss-puzzle-88B matches or slightly exceeds the parent on suite-average accuracy across reasoning efforts, with retention ranging from 100.8% (high) to 108.2% (low), showing that post-training architecture search can substantially reduce inference costs without sacrificing quality.
Paper Structure (35 sections, 9 figures, 12 tables)

This paper contains 35 sections, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Accuracy--speed frontier that accounts for both per-token throughput and tokens generated: (a) an 8$\times$H100 node and (b) a single H100 GPU. The x-axis shows relative request rate (higher is faster), computed as max token throughput (best configuration per model) in a 64K/64K scenario divided by the average number of tokens generated per request across our benchmark suite, and normalized to gpt-oss-120B (KV BF16, high reasoning effort) in the corresponding hardware setting. The y-axis is the suite's average accuracy. Colors denote models (blue: gpt-oss-120B; green: gpt-oss-puzzle-88B; purple: HyperNova-60Bhypernova60b), line style denotes KV precision (KV BF16 dashed; KV FP8 solid), and markers denote reasoning effort (High/Medium/Low). HyperNova-60B is a third-party compressed derivative of gpt-oss-120B.
  • Figure 2: Our model architecture as chosen by Puzzle. Note how the earlier MoE layers appear to be far more important than the later ones. The parent model gpt-oss-120B has 128 experts per layer and alternates between sliding window attention with 128 tokens and global attention.
  • Figure 3: Accuracy and throughput comparisons of gpt-oss-puzzle-88B with its parent, gpt-oss-120B (both with KV FP8).
  • Figure 4: Throughput scaling with batch size on an $8\times$ H100 node. Comparison of (a) long-context and (b) short-context scenarios. Both models (gpt-oss-120B and gpt-oss-puzzle-88B) use KV FP8. Shaded bands show $\pm 1\sigma$.
  • Figure 5: Latency vs throughput trade-off (64K/64K). Both models (gpt-oss-120B and gpt-oss-puzzle-88B) use KV FP8. Shaded bands show $\pm 1\sigma$ across repeated runs.
  • ...and 4 more figures