Table of Contents
Fetching ...

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, Shiyu Chang

Abstract

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Abstract

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.

Paper Structure

This paper contains 40 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Left: A SkillsBench example where the task asks agents to identify flooding days at USGS stations. The three curated skills collectively provide the specific API to call, the data source URL for flood thresholds, and code snippets for flood detection (task-specific details are highlighted in blue), effectively forming a step-by-step solution guide. These skills are directly placed in the agent's context without requiring retrieval. Right: Agent pass rates on SkillsBench degrade as evaluation settings become more realistic, from curated skills to settings where agents must retrieve skills from a large collection.
  • Figure 2: (a) Pass rates on SkillsBench under progressively realistic settings, including a force-loaded upper bound. Performance degrades consistently as settings become more realistic. (b) Skill usage across settings. Solid bars show the fraction of trajectories that load any skill; hatched bars show the fraction that load all curated skills. Agents often fail to load curated skills even when they are directly available, and the gap widens as distractors are added and retrieval is required.
  • Figure 3: Example of query-specific refinement on a Terminal-Bench 2.0 tensor parallelism task. Top: Without refinement, the agent retrieves two partially relevant skills but only loads torch-tensor-parallel, ignoring pytorch-research. The loaded skill covers weight sharding but lacks differentiable collective wrappers, leading to wrong implementation for world_size $>$ 1. Bottom: After refinement, the agent synthesizes a new skill that merges tensor parallelism knowledge from the first skill with custom autograd.Function patterns from the second, producing an implementation that passes all tests.