Table of Contents
Fetching ...

Routesplain: Towards Faithful and Intervenable Routing for Software-related Tasks

Adam Štorek, Vikas Upadhyay, Marianne Menglin Liu, Daniel W. Peterson, Anshul Mittal, Sujeeth Bharadwaj, Fahad Shah, Dan Roth

TL;DR

Routesplain introduces a concept-based LLM router for software-related tasks, formulating routing as a two-stage process where embeddings → concepts $h: \mathbb{R}^d \to \mathbb{R}^k$ and concepts → model scores $g: \mathbb{R}^k \to \mathbb{R}^n$ yield $f = g \circ h$. Training optimizes $\hat{h}$ with $L_{BCE}(h(\mathbf{x}^i), \mathbf{c}^i)$ and $\hat{g}$ with $L_{BCE}(g(\mathbf{c}^i), \mathbf{m}^i) + \lambda \cdot \mathrm{cost}(\mathbf{x}^i)$, enabling cost-aware routing and human-intervenable concept edits during inference. The evaluation spans eight software-related tasks and 16 LLMs, using 38,685 examples across diverse datasets, with Routesplain achieving competitive accuracy and Pareto-optimal cost-accuracy tradeoffs while providing faithful, interpretable rationales. Ablation and counterfactual concept-manipulation experiments reveal that predicting query complexity is the principal bottleneck, guiding targeted improvements; concept-level interventions demonstrate predictable shifts in routing decisions, validating controllability. Overall, Routesplain demonstrates that interpretable, domain-specific routing can match or exceed black-box baselines and offers diagnostic insights for future router enhancements with practical cost benefits.

Abstract

LLMs now tackle a wide range of software-related tasks, yet we show that their performance varies markedly both across and within these tasks. Routing user queries to the appropriate LLMs can therefore help improve response quality while reducing cost. Prior work, however, has focused mainly on general-purpose LLM routing via black-box models. We introduce Routesplain, the first LLM router for software-related tasks, including multilingual code generation and repair, input/output prediction, and computer science QA. Unlike existing routing approaches, Routesplain first extracts human-interpretable concepts from each query (e.g., task, domain, reasoning complexity) and only routes based on these concepts, thereby providing intelligible, faithful rationales. We evaluate Routesplain on 16 state-of-the-art LLMs across eight software-related tasks; Routesplain outperforms individual models both in terms of accuracy and cost, and equals or surpasses all black-box baselines, with concept-level intervention highlighting avenues for further router improvements.

Routesplain: Towards Faithful and Intervenable Routing for Software-related Tasks

TL;DR

Routesplain introduces a concept-based LLM router for software-related tasks, formulating routing as a two-stage process where embeddings → concepts and concepts → model scores yield . Training optimizes with and with , enabling cost-aware routing and human-intervenable concept edits during inference. The evaluation spans eight software-related tasks and 16 LLMs, using 38,685 examples across diverse datasets, with Routesplain achieving competitive accuracy and Pareto-optimal cost-accuracy tradeoffs while providing faithful, interpretable rationales. Ablation and counterfactual concept-manipulation experiments reveal that predicting query complexity is the principal bottleneck, guiding targeted improvements; concept-level interventions demonstrate predictable shifts in routing decisions, validating controllability. Overall, Routesplain demonstrates that interpretable, domain-specific routing can match or exceed black-box baselines and offers diagnostic insights for future router enhancements with practical cost benefits.

Abstract

LLMs now tackle a wide range of software-related tasks, yet we show that their performance varies markedly both across and within these tasks. Routing user queries to the appropriate LLMs can therefore help improve response quality while reducing cost. Prior work, however, has focused mainly on general-purpose LLM routing via black-box models. We introduce Routesplain, the first LLM router for software-related tasks, including multilingual code generation and repair, input/output prediction, and computer science QA. Unlike existing routing approaches, Routesplain first extracts human-interpretable concepts from each query (e.g., task, domain, reasoning complexity) and only routes based on these concepts, thereby providing intelligible, faithful rationales. We evaluate Routesplain on 16 state-of-the-art LLMs across eight software-related tasks; Routesplain outperforms individual models both in terms of accuracy and cost, and equals or surpasses all black-box baselines, with concept-level intervention highlighting avenues for further router improvements.

Paper Structure

This paper contains 24 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the Routesplain framework. Each query is embedded using a fixed embedding model. First, a concept classifier takes the contextualized embedding and projects it into the concept space, where each concept represents a high-level, interpretable characteristic of the query (e.g., task, domain, programming language). Then, a separate model classifier takes a concept-space input and outputs model suitability predictions. The query is routed to the most suitable model.
  • Figure 2: Intertask Performance. Left: 0-shot pass@1 accuracy of each LLM on each task. Right: Share of queries each LLM answered most cost-effectively per task. An LLM provides the optimal answer if it correctly answers a query at the lowest cost among all LLMs that answered correctly. For example, o3 provided the most cost-effective correct answer for 10.11% of BCB-Repair queries.
  • Figure 3: 0-shot pass@1 intratask accuracy. Left: Computer-science-related QA, stratified by the natural language of the query. Right: Multi-programming-language code completion, stratified by the programming language of the query. For radar plots, we select six representative models.
  • Figure 4: 0-shot pass@1 intratask accuracy comparison across open-domain code generation and repair tasks, stratified by the domain of the query, for six representative models.
  • Figure 5: Left: Accuracy against cost of each individual model as well as all routers on the test set. For EmbedLLM, MLP router, and Routesplain, we show accuracy and cost averaged across 5 training runs. For Routesplain and MLP router, we display the Pareto frontier established by connecting the average router performance across multiple values of the cost regularizer $\lambda$. Right: The average number of times each model was assigned to an input by Routesplain, as we increase the $\lambda$ regularizer. We increase $\lambda$ by $0.1$ on the interval $[0; 1)$ and then by $1$ on the interval $[1;10]$. We display only the top 8 most assigned models.
  • ...and 2 more figures