Table of Contents
Fetching ...

APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training

Jiarui Qin, Yunjia Xi, Junjie Huang, Renting Rui, Di Yin, Weiwen Liu, Yong Yu, Weinan Zhang, Xing Sun

TL;DR

APTBench introduces a lightweight, trajectory-based benchmark to evaluate the agentic potential of base LLMs during pre-training. By converting real-world agent tasks into MCQ and text-completion formats across SWE and DR domains, it targets core abilities like planning, action, and atomic tasks, while leveraging long, multi-turn trajectories for contextual richness. Experiments across model sizes and data regimes show emergent agentic capabilities at larger scales and strong correlations with downstream agent performance, suggesting data quality and task alignment are critical for agentic pre-training. The framework offers a scalable, cost-effective alternative to post-training agent evaluations and provides a foundation for guiding pre-training data mix and model design toward improved agentic behavior.

Abstract

With the rapid development of LLM-based agents, there is a growing trend to incorporate agent-specific data into the pre-training stage of LLMs, aiming to better align LLMs with real-world autonomous task execution. However, current pre-training benchmarks primarily focus on isolated and static skills, e.g., common knowledge or mathematical/code reasoning, and fail to reflect model's agentic capabilities. On the other hand, agent benchmarks are typically designed for post-trained models, requiring multi-turn task execution abilities that base models struggle to support. Thus, there is a compelling need for a benchmark that can evaluate agentic potentials during pre-training and guide the model training more effectively. To address this gap, we propose APTBench, a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions tailored for base models. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research. Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model's downstream performance as an agent, while remaining significantly more lightweight and cost-effective than full-scale, end-to-end agent evaluations after post-training.

APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training

TL;DR

APTBench introduces a lightweight, trajectory-based benchmark to evaluate the agentic potential of base LLMs during pre-training. By converting real-world agent tasks into MCQ and text-completion formats across SWE and DR domains, it targets core abilities like planning, action, and atomic tasks, while leveraging long, multi-turn trajectories for contextual richness. Experiments across model sizes and data regimes show emergent agentic capabilities at larger scales and strong correlations with downstream agent performance, suggesting data quality and task alignment are critical for agentic pre-training. The framework offers a scalable, cost-effective alternative to post-training agent evaluations and provides a foundation for guiding pre-training data mix and model design toward improved agentic behavior.

Abstract

With the rapid development of LLM-based agents, there is a growing trend to incorporate agent-specific data into the pre-training stage of LLMs, aiming to better align LLMs with real-world autonomous task execution. However, current pre-training benchmarks primarily focus on isolated and static skills, e.g., common knowledge or mathematical/code reasoning, and fail to reflect model's agentic capabilities. On the other hand, agent benchmarks are typically designed for post-trained models, requiring multi-turn task execution abilities that base models struggle to support. Thus, there is a compelling need for a benchmark that can evaluate agentic potentials during pre-training and guide the model training more effectively. To address this gap, we propose APTBench, a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions tailored for base models. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research. Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model's downstream performance as an agent, while remaining significantly more lightweight and cost-effective than full-scale, end-to-end agent evaluations after post-training.

Paper Structure

This paper contains 71 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The correlation between model's performance on general benchmarks (MMLU, EvalPlus, and GSM8K) and agent benchmarks (SWE-bench Verified) is low. In subfigure (a), six models with similar MMLU scores (86-88) show a 30-point difference on SWE-Bench. Similar patterns can also be observed in the other two subfigures. We also report the Pearson correlation coefficient (r) and p-value. Their low r-values and high p-values also indicate a weak correlation.
  • Figure 2: The construction process of APTBench. Firstly, we collect agentic tasks and successful trajectories from real-world domains. Then, we generate multi-choice question and text completion tasks through correct answer extraction and negative choices generation processes.
  • Figure 3: The prompt length distribution of APTBench.
  • Figure 4: The correlation between model's performance on agent benchmarks (SWE-bench Verified) and our APTBench (SWE, SWE w/o long-context tasks, DR, and DR w/o long-context tasks). The high Pearson correlation coefficient (r) and low p-values indicate a strong correlation.
  • Figure 5: The correlation between model's performance on general benchmarks (MMLU, EvalPlus, and GPM8K) and agent benchmarks (Terminal-Bench).
  • ...and 3 more figures