Table of Contents
Fetching ...

Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

Bharadwaj Yadavalli

TL;DR

Dynamic Template Selection (DTS) addresses the cost inefficiency of uniform prompting in LLM deployments by routing queries to templates of appropriate verbosity. The authors compare an MLP router based on embeddings to a fine-tuned RoBERTa transformer, finding the lightweight MLP achieves 90.5% routing accuracy with substantially fewer parameters, while maintaining comparable performance to the transformer. DTS demonstrates provider-agnostic generalization across OpenAI, Gemini, and Claude, delivering about 33% reductions in costly output tokens in production API calls. The work provides formal problem formulation, four algorithmic approaches with complexity analyses, and extensive empirical validation, illustrating significant practical cost savings without compromising output quality. Overall, DTS offers a scalable, architecture-agnostic strategy for production LLM cost optimization through adaptive, query-aware prompting.

Abstract

Contemporary large language model deployments typically employ uniform prompting strategies across diverse query types, applying verbose response patterns to both complex analytical tasks and straightforward factual questions. This one-size-fits-all methodology leads to substantial token inefficiency, a concern amplified by the significant cost differential between input and output tokens--the latter commanding 4-8x higher prices across major providers. We present Dynamic Template Selection (DTS), which adaptively matches response templates to query complexity, achieving significant cost reductions without compromising response quality. We compared two routing approaches: a simple MLP that uses pre-computed embeddings and a more complex fine-tuned RoBERTa transformer. Through comprehensive evaluation on 1,000 MMLU questions, we find that the MLP router achieves 90.5% routing accuracy on held-out test data, marginally exceeding RoBERTa's performance (89.5%) despite utilizing 125M fewer parameters. Notably, our empirical analysis reveals provider-agnostic behavior in template selection--routing decisions generalize effectively across 3 major LLM providers (OpenAI GPT-4, Google Gemini, and Anthropic Claude), as validated through 9,000 production API calls. While routing accuracy remains consistent at 90.5% across providers, observed token reductions vary from 32.6% to 33.9%, reflecting provider-specific generation characteristics. This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.

Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

TL;DR

Dynamic Template Selection (DTS) addresses the cost inefficiency of uniform prompting in LLM deployments by routing queries to templates of appropriate verbosity. The authors compare an MLP router based on embeddings to a fine-tuned RoBERTa transformer, finding the lightweight MLP achieves 90.5% routing accuracy with substantially fewer parameters, while maintaining comparable performance to the transformer. DTS demonstrates provider-agnostic generalization across OpenAI, Gemini, and Claude, delivering about 33% reductions in costly output tokens in production API calls. The work provides formal problem formulation, four algorithmic approaches with complexity analyses, and extensive empirical validation, illustrating significant practical cost savings without compromising output quality. Overall, DTS offers a scalable, architecture-agnostic strategy for production LLM cost optimization through adaptive, query-aware prompting.

Abstract

Contemporary large language model deployments typically employ uniform prompting strategies across diverse query types, applying verbose response patterns to both complex analytical tasks and straightforward factual questions. This one-size-fits-all methodology leads to substantial token inefficiency, a concern amplified by the significant cost differential between input and output tokens--the latter commanding 4-8x higher prices across major providers. We present Dynamic Template Selection (DTS), which adaptively matches response templates to query complexity, achieving significant cost reductions without compromising response quality. We compared two routing approaches: a simple MLP that uses pre-computed embeddings and a more complex fine-tuned RoBERTa transformer. Through comprehensive evaluation on 1,000 MMLU questions, we find that the MLP router achieves 90.5% routing accuracy on held-out test data, marginally exceeding RoBERTa's performance (89.5%) despite utilizing 125M fewer parameters. Notably, our empirical analysis reveals provider-agnostic behavior in template selection--routing decisions generalize effectively across 3 major LLM providers (OpenAI GPT-4, Google Gemini, and Anthropic Claude), as validated through 9,000 production API calls. While routing accuracy remains consistent at 90.5% across providers, observed token reductions vary from 32.6% to 33.9%, reflecting provider-specific generation characteristics. This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.

Paper Structure

This paper contains 49 sections, 2 theorems, 7 equations, 4 figures, 6 tables, 4 algorithms.

Key Result

Theorem 1

tishby2000information Let $X \in \mathbb{R}^d$ be the embedding representation of query $Q$, and $Y \in \mathcal{T}$ be the routing decision. The optimal routing function maximizes mutual information: where $\beta$ controls the tradeoff between compression and accuracy.

Figures (4)

  • Figure 1: Token Savings by Provider. Using identical routing decisions, we observe different token savings across providers due to provider-specific response generation patterns. MLP-based DTS achieves 33.0% (OpenAI), 33.9% (Gemini), and 32.6% (Claude) token reduction compared to always-verbose baseline.
  • Figure 2: Template Distribution Across Providers. Distribution of template selections by the MLP router across 1,000 MMLU test questions. The router demonstrates intelligent template selection, with 51.8% verbose, 28.5% standard, 10.4% executive, 7.4% minimal, and 1.9% technical templates, matching query complexity to appropriate response verbosity levels.
  • Figure 3: Accuracy by Template Type Across Providers. MMLU accuracy varies by template complexity, demonstrating the quality-cost tradeoff. Verbose templates achieve highest accuracy (69-83%), while minimal templates show variable performance (51-79%), validating DTS's core hypothesis that different queries benefit from different template complexities. Gemini demonstrates superior performance across all template types.
  • Figure 4: Cost Comparison: DTS vs Always-Verbose Baseline. Output token costs for 1,000 MMLU questions across higher-tier LLM models. DTS achieves substantial cost savings: $1,646 (OpenAI GPT-4o), $1,678 (Gemini 2.5 Pro), and $2,225 (Claude Sonnet 4) per 1 million queries (multiply shown costs by 1,000). Cost savings scale with output token pricing multipliers (4-8×) and provider-specific token generation patterns.

Theorems & Definitions (5)

  • Remark 1
  • Definition 1: Dynamic Template Selection Problem
  • Theorem 1: Information Bottleneck for DTS
  • Theorem 2: Generalization Bound - Standard PAC Learning
  • Remark 2