Table of Contents
Fetching ...

PostTrainBench: Can LLM Agents Automate LLM Post-Training?

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym Andriushchenko

TL;DR

This paper explores post-training, the critical phase that turns base LLMs into useful assistants, and introduces PostTrainBench, a benchmark to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints.

Abstract

AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.

PostTrainBench: Can LLM Agents Automate LLM Post-Training?

TL;DR

This paper explores post-training, the critical phase that turns base LLMs into useful assistants, and introduces PostTrainBench, a benchmark to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints.

Abstract

AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.
Paper Structure (51 sections, 1 equation, 7 figures, 6 tables)

This paper contains 51 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Weighted average benchmark performance for different agents across 4 base models (Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, Gemma-3-4B) and 7 benchmarks: AIME 2025 and GSM8K (math), GPQA (science), HumanEval (coding), BFCL (function calling), Arena-Hard (creative writing), and HealthBench (health advice). The averaging weights are specified in Table \ref{['tab:weights-for-avg']}. The error bars show $\pm{}1$ standard deviation across runs.
  • Figure 2: PostTrainBench pipeline. An agent receives a base LLM, target benchmark, and 10 hours on one H100 GPU, then post-trains the model to maximize performance. An LLM judge detects cheating (model substitution, data contamination); flagged runs receive the base model score. Each agent is evaluated on 28 model–benchmark configurations (4 base LLMs $\times$ 7 benchmarks); frontier agents on native scaffolds are run 3 times per configuration to estimate variance.
  • Figure 3: Condensed execution trace of Opus 4.5 (Claude Code) post-training Gemma-3-4B-Base for HumanEval. The agent implements contamination filtering, adapts to timeout failures, and debugs vLLM issues. The agent post-trains the model from initial performance of 0% to 37.3%, 104 turns, 9:20 hours, $4.62 API cost.
  • Figure 4: Performance for various model sizes of Claude.
  • Figure 5: Effect of time budget on agent performance, averaged across all base models and benchmarks. Claude Opus 4.5 performance plateaus around 5 hours, while GPT-5.1 Codex Max continues improving up to 10 hours.
  • ...and 2 more figures