Table of Contents
Fetching ...

Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang, Zheng Li, Zhengyang Wang, Pei Chen, Ruijie Wang, Rongzhi Zhang, Nasser Zalmout, Priyanka Nigam, Bing Yin, Chao Zhang

TL;DR

This work tackles the limited availability of agent-oriented pre-training data for large language models and the trade-offs of prompting versus fine-tuning. It introduces Hephaestus-Forge, a large-scale, multi-source corpus (103B tokens across 76,537 APIs) designed to strengthen API function calling, intrinsic reasoning, and adaptation to environmental feedback, and it uses a three-stage continual pre-training plus instruction fine-tuning framework to produce Hephaestus. Scaling-law analyses identify an optimal data composition around $36\%$ agent data and a balanced $1:1:1$ mix with text and code, guiding data-coupled pre-training strategies. Empirical results show Hephaestus outperforming small- to medium-scale open-source LLMs and rivaling API-based commercial LLMs on multiple agent benchmarks, with strong cross-task generalization and preservation of general capabilities, highlighting the practical impact for open-source autonomous agents.

Abstract

Due to the scarcity of agent-oriented pre-training data, LLM-based autonomous agents typically rely on complex prompting or extensive fine-tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus-Forge, the first large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre-training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.

Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

TL;DR

This work tackles the limited availability of agent-oriented pre-training data for large language models and the trade-offs of prompting versus fine-tuning. It introduces Hephaestus-Forge, a large-scale, multi-source corpus (103B tokens across 76,537 APIs) designed to strengthen API function calling, intrinsic reasoning, and adaptation to environmental feedback, and it uses a three-stage continual pre-training plus instruction fine-tuning framework to produce Hephaestus. Scaling-law analyses identify an optimal data composition around agent data and a balanced mix with text and code, guiding data-coupled pre-training strategies. Empirical results show Hephaestus outperforming small- to medium-scale open-source LLMs and rivaling API-based commercial LLMs on multiple agent benchmarks, with strong cross-task generalization and preservation of general capabilities, highlighting the practical impact for open-source autonomous agents.

Abstract

Due to the scarcity of agent-oriented pre-training data, LLM-based autonomous agents typically rely on complex prompting or extensive fine-tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus-Forge, the first large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre-training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.

Paper Structure

This paper contains 52 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Training paradigms of LLM agents. Prompting alone fails to introduce new knowledge and capabilities, while heavy fine-tuning can hinder generalization and degrade performance in non-agent use cases, potentially suppressing the original base model capabilities.
  • Figure 2: Data composition of (a) the entire Hephaestus-Forge, (b) seed data collection (\ref{['sec:data-phase1']}), and (c) retrieved agent data from the open web (\ref{['sec:data-phase2']}). A t-SNE visualization (d) depicts seed data (colorful points, with each color representing different data sources), retrieved data (black), and general text (gray) within the semantic space, where retrieved data is closer to the selected seed data than to the general text. Detailed data sources are in \ref{['app:data-pretrain']}.
  • Figure 3: Scaling law of the relationship between agent data mixing ratio ($\%$) and benchmark loss.
  • Figure 4: Overview of the pre-training (Stages I & II) and instruction fine-tuning (III) framework in Hephaestus.
  • Figure 5: Training and benchmark loss. (a) Training loss of Hephaestus during continual pre-training and instruction fine-tuning. (b) Benchmark loss at periodic training checkpoints and (c) a comparison across base models.
  • ...and 2 more figures