Table of Contents
Fetching ...

EvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, Philip S. Yu

Abstract

Anthropic proposes the concept of skills for LLM agents to tackle multi-step professional tasks that simple tool invocations cannot address. A tool is a single, self-contained function, whereas a skill is a structured bundle of interdependent multi-file artifacts. Currently, skill generation is not only label-intensive due to manual authoring, but also may suffer from human--machine cognitive misalignment, which can lead to degraded agent performance, as evidenced by evaluations on SkillsBench. Therefore, we aim to enable agents to autonomously generate skills. However, existing self-evolving methods designed for tools cannot be directly applied to skills due to their increased complexity. To address these issues, we propose EvoSkills, a self-evolving skills framework that enables agents to autonomously construct complex, multi-file skill packages. Specifically, EvoSkills couples a Skill Generator that iteratively refines skills with a Surrogate Verifier that co-evolves to provide informative and actionable feedback without access to ground-truth test content. On SkillsBench, EvoSkills achieves the highest pass rate among five baselines on both Claude Code and Codex, and also exhibits strong generalization capabilities to six additional LLMs.

EvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Abstract

Anthropic proposes the concept of skills for LLM agents to tackle multi-step professional tasks that simple tool invocations cannot address. A tool is a single, self-contained function, whereas a skill is a structured bundle of interdependent multi-file artifacts. Currently, skill generation is not only label-intensive due to manual authoring, but also may suffer from human--machine cognitive misalignment, which can lead to degraded agent performance, as evidenced by evaluations on SkillsBench. Therefore, we aim to enable agents to autonomously generate skills. However, existing self-evolving methods designed for tools cannot be directly applied to skills due to their increased complexity. To address these issues, we propose EvoSkills, a self-evolving skills framework that enables agents to autonomously construct complex, multi-file skill packages. Specifically, EvoSkills couples a Skill Generator that iteratively refines skills with a Surrogate Verifier that co-evolves to provide informative and actionable feedback without access to ground-truth test content. On SkillsBench, EvoSkills achieves the highest pass rate among five baselines on both Claude Code and Codex, and also exhibits strong generalization capabilities to six additional LLMs.

Paper Structure

This paper contains 40 sections, 7 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Tool--skill difference illustration.
  • Figure 2: Skill quality improvement across 5 evolution rounds. EvoSkills surpasses human-curated skills within 5 evolution iterations.
  • Figure 3: Overview of the EvoSkills co-evolutionary framework. The Skill Generator and Surrogate Verifier co-evolve through iterative refinement. The verifier provides structured failure feedback to drive skill improvement, while a ground-truth oracle test returns only an opaque pass/fail signal, triggering test escalation and ensuring strict information isolation.
  • Figure 4: Skill quality comparisons with baselines on SkillsBench (Claude Opus 4.6 + Claude-Code). Error bars: $\pm$1 std over 5 runs.
  • Figure 5: Cross-model skill transferability on SkillsBench. Skills evolved by Claude Opus 4.6 are transferred to six additional models spanning five providers. Each pair of bars shows the no-skill baseline (red) and the with-skills pass rate (blue). Delta annotations indicate absolute improvement. All models benefit substantially (+36--44pp), confirming that the evolved skills encode reusable task structure rather than model-specific artifacts.
  • ...and 3 more figures