SkillRouter: Skill Routing for LLM Agents at Scale

YanZhao Zheng; ZhenTao Zhang; Chao Ma; YuanQiang Yu; JiHuai Zhu; Yong Wu; Tianze Xu; Baohua Dong; Hangcheng Zhu; Ruohui Huang; Gang Yu

SkillRouter: Skill Routing for LLM Agents at Scale

YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu

Abstract

Reusable skills let LLM agents package task-specific procedures, tool affordances, and execution guidance into modular building blocks. As skill ecosystems grow to tens of thousands of entries, exposing every skill at inference time becomes infeasible. This creates a skill-routing problem: given a user task, the system must identify relevant skills before downstream planning or execution. Existing agent stacks often rely on progressive disclosure, exposing only skill names and descriptions while hiding the full implementation body. We examine this design choice on a SkillsBench-derived benchmark with approximately 80K candidate skills, targeting the practically important setting of large skill registries with heavy overlap. Across representative sparse, dense, and reranking baselines on this setting, hiding the skill body causes a 31--44 percentage point drop in routing accuracy, showing that full skill text is a critical routing signal in this setting rather than a minor metadata refinement. Motivated by this finding, we present SkillRouter, a compact 1.2B full-text retrieve-and-rerank pipeline. SkillRouter achieves 74.0% Hit@1 on our benchmark -- the strongest average top-1 routing performance among the baselines we evaluate -- while using 13$\times$ fewer parameters and running 5.8$\times$ faster than the strongest base pipeline. The ranking gains further generalize to a supplementary benchmark independently constructed from three skill sources. In a complementary end-to-end study across four coding agents, routing gains transfer to improved task success, with larger gains for more capable agents.

SkillRouter: Skill Routing for LLM Agents at Scale

Abstract

fewer parameters and running 5.8

faster than the strongest base pipeline. The ranking gains further generalize to a supplementary benchmark independently constructed from three skill sources. In a complementary end-to-end study across four coding agents, routing gains transfer to improved task success, with larger gains for more capable agents.

Paper Structure (89 sections, 5 equations, 4 figures, 28 tables)

This paper contains 89 sections, 5 equations, 4 figures, 28 tables.

Introduction
Problem definition and benchmark
Task and metrics.
Benchmark construction.
Benchmark credibility and scope.
What signals drive skill selection?
Body removal collapses performance across method families.
Length-controlled attention supports the same story.
Implication.
SkillRouter: a compact full-text routing recipe
Bi-encoder retrieval.
Hard negative mining.
False negative filtering.
Cross-encoder reranking.
Implementation details.
...and 74 more sections

Figures (4)

Figure 1: Full skill text is a critical routing signal.Left: Averaged over the paper's Easy and Hard tiers, removing body reduces Hit@1 by 31.4pp for BM25, 38.7pp for Qwen3-Emb-8B, and 44.0pp for Qwen3-Emb-8B $\times$ Qwen3-Rank-8B. Right: Length-controlled attention diagnostics argue against a simple length-only explanation: although the body field occupies 96.5% of skill tokens, the short name field peaks at 26.3% attention in layer 19 despite covering only 3.0% of tokens, while the final layer returns to 98.1% body attention.
Figure 2: SkillRouter pipeline. A bi-encoder retrieves top-20 candidates from the full ${\sim}$80K pool; a cross-encoder reranks them. Both stages use full skill text, motivated by the body-access finding in Section \ref{['sec:body_study']}.
Figure 3: Length-controlled attention visualization for SR-Rank-0.6B on 75 query-skill pairs. Left: per-layer mean attention trajectories compared against each field's token-share baseline; shaded bands denote $\pm 1$ standard deviation across the 75 query-skill pairs. The short name field spikes far above its 3.0% token-share baseline in the middle layers, while the final layer returns to body. Right: query-level final-layer body attention compared against each query's body-token baseline. Most points lie above the diagonal, meaning the final layer attends to body more than a pure length-based baseline would predict.
Figure 4: Recall@$K$ candidate coverage for three encoder retrievers on Easy and Hard. The star marker at $K{=}20$ indicates the primary SR-Emb-0.6B $\times$ SR-Rank-0.6B pipeline's end-to-end Hit@1 at the main operating point, and is shown only as a reference against the Recall@$K$ curves.

SkillRouter: Skill Routing for LLM Agents at Scale

Abstract

SkillRouter: Skill Routing for LLM Agents at Scale

Authors

Abstract

Table of Contents

Figures (4)