SkewRoute: Training-Free LLM Routing for Knowledge Graph Retrieval-Augmented Generation via Score Skewness of Retrieved Context
Hairu Wang, Yuan Feng, Yukun Cao, Xike Xie, S Kevin Zhou
TL;DR
This work tackles the high inference cost of KG-RAG by introducing SkewRoute, a training-free LLM routing framework that leverages the skewness of retrieved context scores to gauge query difficulty and allocate work between smaller and larger LLMs. By exploiting simple-to-complex routing based on score distributions, SkewRoute achieves strong cost-performance trade-offs, with substantial improvements over baselines across multiple model sizes and families and across WebQSP and CWQ datasets. The method is plug-and-play, CPU-friendly, and generalizes beyond a single scorer, making it practical for real-world KG-RAG deployments. The results demonstrate significant reductions in large-LMM usage (e.g., up to 8x gains over RouteLLM on WebQSP) while maintaining or improving Hit@1 performance, highlighting the potential of distribution-skew-aware routing to enable cost-effective, scalable KG-RAG systems.
Abstract
Large language models excel at many tasks but often incur high inference costs during deployment. To mitigate hallucination, many systems use a knowledge graph to enhance retrieval-augmented generation (KG-RAG). However, the large amount of retrieved knowledge contexts increase these inference costs further. A promising solution to balance performance and cost is LLM routing, which directs simple queries to smaller LLMs and complex ones to larger LLMs. However, no dedicated routing methods currently exist for RAG, and existing training-based routers face challenges scaling to this domain due to the need for extensive training data. We observe that the score distributions produced by the retrieval scorer strongly correlate with query difficulty. Based on this, we propose an extremely simple yet effective routing framework, the first specifically designed for KG-RAG that efficiently balances performance and cost in a plug-and-play manner. It delivers over 3x higher routing effectiveness while reducing runtime to less than 0.001x compared to existing methods. Our code is available at https://github.com/hrwang00/SkewRoute.
