SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

Shiqi Chen; Jingze Gai; Ruochen Zhou; Jinghan Zhang; Tongyao Zhu; Junlong Li; Kangrui Wang; Zihan Wang; Zhengyu Chen; Klara Kaleb; Ning Miao; Siyang Gao; Cong Lu; Manling Li; Junxian He; Yee Whye Teh

SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, Yee Whye Teh

TL;DR

SkillCraft is introduced, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions, and a lightweight evaluation protocol is proposed that enables agents to auto-compose atomic tools into executable Skills, cache and reuse them inside and across tasks, thereby improving efficiency while accumulating a persistent library of reusable skills.

Abstract

Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills. We address this gap by introducing SkillCraft, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions, where we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse. We further propose a lightweight evaluation protocol that enables agents to auto-compose atomic tools into executable Skills, cache and reuse them inside and across tasks, thereby improving efficiency while accumulating a persistent library of reusable skills. Evaluating state-of-the-art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse. Moreover, success rate strongly correlates with tool composition ability at test time, underscoring compositional skill acquisition as a core capability.

SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

TL;DR

Abstract

Paper Structure (50 sections, 8 figures, 7 tables)

This paper contains 50 sections, 8 figures, 7 tables.

Introduction
SkillCraft
What kinds of tasks can evaluate skill composition?
How to curate such tasks?
How to Evaluate Tool Composition Ability?
SkillCraft Protocol
Four Minimal MCP Primitives
Coding Verifier
SkillCraft Protocol Pipeline
Evaluation
Models
Metrics
Results
What is a good tool composition?
Is Deeper Composition Always Better?
...and 35 more sections

Figures (8)

Figure 2: SkillCraft Protocol Pipeline Overview. The pipeline consists of three stages: (1) Test-Time Tool-Chain Evolution: The agent solves tasks from the Task Library by exploring and chaining atomic tools, forming executable tool sequences. (2) Iterative Skill Composition: Successful sequences are abstracted into candidate skills, executed and verified in a coding environment; failed executions trigger re-exploration, while validated skills are stored. (3) Skill Library and Reuse: A growing repository of verified, reusable skills that can be retrieved in later tasks to replace low-level tool exploration, enabling test-time skill accumulation and efficient composition.
Figure 3: Three-stage task construction pipeline for SkillCraft. In Stage 1, we explore existing benchmarks through systematic experimentation to identify effective task design principles. In Stage 2, we construct seed tasks from three sources: (i) selected tasks from Stage 1 with unified interfaces, (ii) newly handcrafted web API-based tasks, and (iii) local file and data processing tasks. In Stage 3, we systematically scale the seed tasks via quantitative scaling (increasing subtask count) and complexity scaling (increasing tool calls per subtask), producing a task repository with graduated difficulty levels.
Figure 4: Task distribution in SkillCraft. The chart shows 21 task families across 6 application domains. The table summarizes difficulty levels: Entity Num = number of target items (subtasks) per task; Complexity = tool calls required per entity.
Figure 5: Cross-metric correlation heatmap. Metrics are grouped into four categories: Success, Skill, Eff_Base, and Eff_Save. Key findings: (1) Skill execution rate correlates with task success (r=0.65); (2) Stronger models achieve greater efficiency gains from skills (r=0.53).
Figure 6: (a) Hierarchical skill composition in Iteration mode. A task organized as a depth-3 skill hierarchy, where atomic tools are encapsulated by low-level skills , composed into medium-level skills with additional processing, and orchestrated by a high-level skill . Efficiency gains compound across levels. (b) Error propagation in hierarchical skills. A null value returned by a low-level skill triggers a TypeError in the medium-level skill, which cascades into complete failure of the high-level skill. The tree structure amplifies the impact of edge-case bugs.
...and 3 more figures

SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

TL;DR

Abstract

SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

Authors

TL;DR

Abstract

Table of Contents

Figures (8)