Table of Contents
Fetching ...

LLMBridge: Reducing Costs to Access LLMs in a Prompt-Centric Internet

Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar

TL;DR

This paper addresses the rising cost of accessing large language models by proposing a prompt-centric proxy, LLMBridge, that enables cost-aware decisions through three core components: model selection, context management, and semantic caching. It defines a high-level, bidirectional API that allows applications to delegate low-level choices to the proxy while maintaining transparency and enabling iterative refinement. The authors implement LLMBridge in a serverless AWS deployment and demonstrate its practicality through real-world deployments (WhatsApp Q&A service) and educational classroom use, reporting meaningful cost reductions and competitive quality. Microbenchmarks and case studies show that intelligent model routing, selective context, and cached data can substantially cut costs (e.g., up to ~40% via model selection, 30–50% via smart context) while maintaining acceptable latency and response quality, underscoring the approach’s potential for cost-sensitive environments. Overall, LLMBridge offers a scalable, flexible framework to operationalize cost-efficient prompt-based AI services in both developing-region deployments and educational settings, with a clear path toward broader adoption and further optimization.

Abstract

Today's Internet infrastructure is centered around content retrieval over HTTP, with middleboxes (e.g., HTTP proxies) playing a crucial role in performance, security, and cost-effectiveness. We envision a future where Internet communication will be dominated by "prompts" sent to generative AI models. For this, we will need proxies that provide similar functions to HTTP proxies (e.g., caching, routing, compression) while dealing with unique challenges and opportunities of prompt-based communication. As a first step toward supporting prompt-based communication, we present LLMBridge, an LLM proxy designed for cost-conscious users, such as those in developing regions and education (e.g., students, instructors). LLMBridge supports three key optimizations: model selection (routing prompts to the most suitable model), context management (intelligently reducing the amount of context), and semantic caching (serving prompts using local models and vector databases). These optimizations introduce trade-offs between cost and quality, which applications navigate through a high-level, bidirectional interface. As case studies, we deploy LLMBridge in two cost-sensitive settings: a WhatsApp-based Q&A service and a university classroom environment. The WhatsApp service has been live for over twelve months, serving 100+ users and handling more than 14.7K requests. In parallel, we exposed LLMBridge to students across three computer science courses over a semester, where it supported diverse LLM-powered applications - such as reasoning agents and chatbots - and handled an average of 500 requests per day. We report on deployment experiences across both settings and use the collected workloads to benchmark the effectiveness of various cost-optimization strategies, analyzing their trade-offs in cost, latency, and response quality.

LLMBridge: Reducing Costs to Access LLMs in a Prompt-Centric Internet

TL;DR

This paper addresses the rising cost of accessing large language models by proposing a prompt-centric proxy, LLMBridge, that enables cost-aware decisions through three core components: model selection, context management, and semantic caching. It defines a high-level, bidirectional API that allows applications to delegate low-level choices to the proxy while maintaining transparency and enabling iterative refinement. The authors implement LLMBridge in a serverless AWS deployment and demonstrate its practicality through real-world deployments (WhatsApp Q&A service) and educational classroom use, reporting meaningful cost reductions and competitive quality. Microbenchmarks and case studies show that intelligent model routing, selective context, and cached data can substantially cut costs (e.g., up to ~40% via model selection, 30–50% via smart context) while maintaining acceptable latency and response quality, underscoring the approach’s potential for cost-sensitive environments. Overall, LLMBridge offers a scalable, flexible framework to operationalize cost-efficient prompt-based AI services in both developing-region deployments and educational settings, with a clear path toward broader adoption and further optimization.

Abstract

Today's Internet infrastructure is centered around content retrieval over HTTP, with middleboxes (e.g., HTTP proxies) playing a crucial role in performance, security, and cost-effectiveness. We envision a future where Internet communication will be dominated by "prompts" sent to generative AI models. For this, we will need proxies that provide similar functions to HTTP proxies (e.g., caching, routing, compression) while dealing with unique challenges and opportunities of prompt-based communication. As a first step toward supporting prompt-based communication, we present LLMBridge, an LLM proxy designed for cost-conscious users, such as those in developing regions and education (e.g., students, instructors). LLMBridge supports three key optimizations: model selection (routing prompts to the most suitable model), context management (intelligently reducing the amount of context), and semantic caching (serving prompts using local models and vector databases). These optimizations introduce trade-offs between cost and quality, which applications navigate through a high-level, bidirectional interface. As case studies, we deploy LLMBridge in two cost-sensitive settings: a WhatsApp-based Q&A service and a university classroom environment. The WhatsApp service has been live for over twelve months, serving 100+ users and handling more than 14.7K requests. In parallel, we exposed LLMBridge to students across three computer science courses over a semester, where it supported diverse LLM-powered applications - such as reasoning agents and chatbots - and handled an average of 500 requests per day. We report on deployment experiences across both settings and use the collected workloads to benchmark the effectiveness of various cost-optimization strategies, analyzing their trade-offs in cost, latency, and response quality.

Paper Structure

This paper contains 33 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: \ref{['subfig:cost_motivation']} Compares the cost, measured by input tokens, when various amounts of previous messages ($k$) are in the the context. \ref{['subfig:quality_motivation']} Compares the quality of each strategy with $k=50$ as the reference.
  • Figure 2: Overview of LLMBridge design.
  • Figure 3: The WhatsApp Q&A service. Buttons 1-3 have pre-fetched (and cached) responses which are returned when a user interacts with them to avoid delays and keep the conversation responsive.
  • Figure 4: Fig. \ref{['subfig:ms_old_models']} compares the quality of verification with $t=8$ and random strategies with $p=0.64$, $p=0.1$ using an earlier generation of models (GPT 3.5, GPT4, Opus). Fig. \ref{['subfig:ms_new_models']} is the same but with new models (GPT4o-mini, GPT4o).
  • Figure 5: \ref{['subfig:opus_v_eq_quality_cost']} compares the cost of answering all prompts using our verification strategy with $t=8$ and our random strategy with $p=0.64$. \ref{['subfig:opus_v_eq_quality_time']} compares the total time. Both are normalized to GPT3.5
  • ...and 2 more figures