Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models

Surya Narayanan Hari; Matt Thomson

Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models

Surya Narayanan Hari, Matt Thomson

TL;DR

The paper introduces Tryage, a perceptive routing system that dynamically selects among a library of language models based on prompt analysis and user constraints, reducing the burden of manual model selection. It formulates routing as a predictive, RL-inspired decision problem and supports end-to-end training between the router and downstream experts, enabling Pareto-front exploration of accuracy versus size/recency/security. Across MLM tasks on the Pile, Tryage outperforms Gorilla and GPT-3.5 Turbo in model selection accuracy and demonstrates domain-aware routing and interpretable latent representations. This work highlights a scalable paradigm for orchestrating multiple LLMs to maximize efficiency and task-specific performance in evolving model ecosystems.

Abstract

The introduction of the transformer architecture and the self-attention mechanism has led to an explosive production of language models trained on specific downstream tasks and data domains. With over 200, 000 models in the Hugging Face ecosystem, users grapple with selecting and optimizing models to suit multifaceted workflows and data domains while addressing computational, security, and recency concerns. There is an urgent need for machine learning frameworks that can eliminate the burden of model selection and customization and unleash the incredible power of the vast emerging model library for end users. Here, we propose a context-aware routing system, Tryage, that leverages a language model router for optimal selection of expert models from a model library based on analysis of individual input prompts. Inspired by the thalamic router in the brain, Tryage employs a perceptive router to predict down-stream model performance on prompts and, then, makes a routing decision using an objective function that integrates performance predictions with user goals and constraints that are incorporated through flags (e.g., model size, model recency). Tryage allows users to explore a Pareto front and automatically trade-off between task accuracy and secondary goals including minimization of model size, recency, security, verbosity, and readability. Across heterogeneous data sets that include code, text, clinical data, and patents, the Tryage framework surpasses Gorilla and GPT3.5 turbo in dynamic model selection identifying the optimal model with an accuracy of 50.9% , compared to 23.6% by GPT 3.5 Turbo and 10.8% by Gorilla. Conceptually, Tryage demonstrates how routing models can be applied to program and control the behavior of multi-model LLM systems to maximize efficient use of the expanding and evolving language model ecosystem.

Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models

TL;DR

Abstract

Paper Structure (1 section, 5 equations, 5 figures)

This paper contains 1 section, 5 equations, 5 figures.

Introduction

Figures (5)

Figure 1: In the Tryage system, a prompt and flag are provided to the system, and it finds the best model to perform the MLM task given the flag and the prompt.
Figure 2: Picking a model for a task is challenging because multiple models display differential performance on different datasets.
Figure 3: a) - Tryage outperforms existing SoTA models on a task of Masked Language Modeling including SoTA generative LLMs such as GPT 3.5-Turbo and Gorilla. b) - By examining which models the tryage system picked for incoming queries of each domain, it becomes easier to build a super-specialized system that can still generalize with high performance across multiple specializations. It also helps build trust and transparency in the system, since one can use tryage to filter for specific domain knowledge of a certain dataset. c) - Performance of models in the MLM task across all the subsets of the shows that the Tryage system outperforms other models trained on MLM. d - Same as c, but averaging across datasets by model, shows gains by Tryage, as well as performance of Gorilla without a human in the loop
Figure 4: By examining the UMAP of the latent space embeddings of queries coming into the tryage model (b), we observe the embeddings of a specific domain cluster. On the other hand, a general model like GPT-2, doesn’t display the same behavior (a) where the points of the same domain don’t cluster.
Figure 5: a) - By trading off performance for effective model size, we can form a Pareto curve along which users can choose their desired performance / latency preference. (b, c, d) - By examining the allocations of the tryage system while increasing $\lambda$, we can see the allocation change from predominantly to large models (b) to a mixture of medium, large and small models (c) to predominantly smaller models (d). This allows for a system that only queries large models when it 'needs’ to, effectively parlaying cost for performance.

Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models

TL;DR

Abstract

Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)