Table of Contents
Fetching ...

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, Shuyue Hu

TL;DR

Experiments show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill set.

Abstract

The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via node-level recursive categorization for efficient discovery; and (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. To evaluate the agent's ability to invoke skills, we construct a benchmark of 30 artifact-rich tasks across five categories: data computation, document creation, motion video, visual design, and web interaction. We assess the quality of task outputs using LLM-based pairwise evaluation, and the results are aggregated via a Bradley-Terry model to produce unified quality scores. Experiments across three skill ecosystem scales (200 to 200K skills) show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill set. Our findings confirm that structured composition is the key to unlocking skill potential. Our GitHub repository is available at:https://github.com/ynulihao/AgentSkillOS.

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

TL;DR

Experiments show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill set.

Abstract

The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via node-level recursive categorization for efficient discovery; and (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. To evaluate the agent's ability to invoke skills, we construct a benchmark of 30 artifact-rich tasks across five categories: data computation, document creation, motion video, visual design, and web interaction. We assess the quality of task outputs using LLM-based pairwise evaluation, and the results are aggregated via a Bradley-Terry model to produce unified quality scores. Experiments across three skill ecosystem scales (200 to 200K skills) show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill set. Our findings confirm that structured composition is the key to unlocking skill potential. Our GitHub repository is available at:https://github.com/ynulihao/AgentSkillOS.
Paper Structure (27 sections, 2 equations, 7 figures, 1 table)

This paper contains 27 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: AgentSkillOS is a principled framework for efficient skill retrieval, orchestration, and ecosystem-level management to solve user-specified tasks.
  • Figure 2: Overview of our benchmark framework. The benchmark consists of a human-crafted dataset of 30 tasks spanning five categories: Data Computation, Document Creation, Motion Video, Visual Design, and Web Interaction. For evaluation, artifacts produced by agents in diverse raw formats (csv, video, json, web, pdf, gif) are first converted into LLM-evaluable formats (text and image) via scripted pipelines. An LLM judge then performs pairwise comparisons between systems with position swapping to mitigate order bias, and the results are aggregated into a global win matrix. Finally, a Bradley--Terry model is fitted via maximum likelihood to obtain strength parameters $\beta_i$, which are rescaled to produce continuous ranking scores $S_i$ for fine-grained differentiation of agent performance.
  • Figure 3: Overview of benchmark tasks. (a) All 30 tasks organized into five categories: Data Computation, Document Creation, Motion Video, Visual Design, and Web Interaction, with six tasks per category. Each task is listed with its abbreviated name. (b) Task complexity distributions across categories, measured by three dimensions: the number of skills required to complete the task, the number of output files the task expects, and the number of distinct output formats involved.
  • Figure 4: Per-category and overall Bradley-Terry scores (rescaled to $[0,100]$) derived from the pairwise comparisons in Table \ref{['tab:main_results']}, shown for three skill ecosystem sizes ($|\mathcal{S}|{=}200$, $1\text{K}$, $200\text{K}$). Larger polygon area indicates stronger performance across categories. Three AgentSkillOS variants achieve the broadest coverage in all settings, consistent with the top Bradley--Terry scores reported in Table \ref{['tab:main_results']}.
  • Figure 5: Ablation study of AgentSkillOS components. Each panel shows pairwise W / T / L counts (green = win, gray = tie, orange = lose) of Quality-First against four ablation variants for $|\mathcal{S}|{=}200$, $1\text{K}$, and $200\text{K}$. Removing DAG orchestration (w/ Oracle Skills, w/ Retrieval) or both retrieval and orchestration (w/ Full Pool) consistently degrades performance, confirming that both components are essential. Quality-First closely approaches the oracle upper bound (Quality-First (Oracle)), validating the effectiveness of tree-based skill retrieval.
  • ...and 2 more figures