Table of Contents
Fetching ...

SkewRoute: Training-Free LLM Routing for Knowledge Graph Retrieval-Augmented Generation via Score Skewness of Retrieved Context

Hairu Wang, Yuan Feng, Yukun Cao, Xike Xie, S Kevin Zhou

TL;DR

This work tackles the high inference cost of KG-RAG by introducing SkewRoute, a training-free LLM routing framework that leverages the skewness of retrieved context scores to gauge query difficulty and allocate work between smaller and larger LLMs. By exploiting simple-to-complex routing based on score distributions, SkewRoute achieves strong cost-performance trade-offs, with substantial improvements over baselines across multiple model sizes and families and across WebQSP and CWQ datasets. The method is plug-and-play, CPU-friendly, and generalizes beyond a single scorer, making it practical for real-world KG-RAG deployments. The results demonstrate significant reductions in large-LMM usage (e.g., up to 8x gains over RouteLLM on WebQSP) while maintaining or improving Hit@1 performance, highlighting the potential of distribution-skew-aware routing to enable cost-effective, scalable KG-RAG systems.

Abstract

Large language models excel at many tasks but often incur high inference costs during deployment. To mitigate hallucination, many systems use a knowledge graph to enhance retrieval-augmented generation (KG-RAG). However, the large amount of retrieved knowledge contexts increase these inference costs further. A promising solution to balance performance and cost is LLM routing, which directs simple queries to smaller LLMs and complex ones to larger LLMs. However, no dedicated routing methods currently exist for RAG, and existing training-based routers face challenges scaling to this domain due to the need for extensive training data. We observe that the score distributions produced by the retrieval scorer strongly correlate with query difficulty. Based on this, we propose an extremely simple yet effective routing framework, the first specifically designed for KG-RAG that efficiently balances performance and cost in a plug-and-play manner. It delivers over 3x higher routing effectiveness while reducing runtime to less than 0.001x compared to existing methods. Our code is available at https://github.com/hrwang00/SkewRoute.

SkewRoute: Training-Free LLM Routing for Knowledge Graph Retrieval-Augmented Generation via Score Skewness of Retrieved Context

TL;DR

This work tackles the high inference cost of KG-RAG by introducing SkewRoute, a training-free LLM routing framework that leverages the skewness of retrieved context scores to gauge query difficulty and allocate work between smaller and larger LLMs. By exploiting simple-to-complex routing based on score distributions, SkewRoute achieves strong cost-performance trade-offs, with substantial improvements over baselines across multiple model sizes and families and across WebQSP and CWQ datasets. The method is plug-and-play, CPU-friendly, and generalizes beyond a single scorer, making it practical for real-world KG-RAG deployments. The results demonstrate significant reductions in large-LMM usage (e.g., up to 8x gains over RouteLLM on WebQSP) while maintaining or improving Hit@1 performance, highlighting the potential of distribution-skew-aware routing to enable cost-effective, scalable KG-RAG systems.

Abstract

Large language models excel at many tasks but often incur high inference costs during deployment. To mitigate hallucination, many systems use a knowledge graph to enhance retrieval-augmented generation (KG-RAG). However, the large amount of retrieved knowledge contexts increase these inference costs further. A promising solution to balance performance and cost is LLM routing, which directs simple queries to smaller LLMs and complex ones to larger LLMs. However, no dedicated routing methods currently exist for RAG, and existing training-based routers face challenges scaling to this domain due to the need for extensive training data. We observe that the score distributions produced by the retrieval scorer strongly correlate with query difficulty. Based on this, we propose an extremely simple yet effective routing framework, the first specifically designed for KG-RAG that efficiently balances performance and cost in a plug-and-play manner. It delivers over 3x higher routing effectiveness while reducing runtime to less than 0.001x compared to existing methods. Our code is available at https://github.com/hrwang00/SkewRoute.

Paper Structure

This paper contains 27 sections, 1 equation, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: A Training-Free Routing Framework for LLMs in KG-RAG. Scores of retrieved contexts sorted in descending exhibit distinct skewness pattern. The framework utilizes the score skewness of retrieved contexts to route requiring no training.
  • Figure 2: Token and Performance-Cost Statistics on CWQ. (a) illustrates how input tokens varies with retrieved contexts in KG-RAG. (b) presents inference cost and performance on LLM cloud service platform of different LLM scales.
  • Figure 3: Score of Retrieved Contexts in CWQ. (a)(b) are plotted in linear coordinates, while (c)(d) employ a log-log scale.
  • Figure 4: Query Difficulty Across Score Skewness.
  • Figure 5: Routing Between Multiple Models
  • ...and 5 more figures