Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

Bharadwaj Yadavalli

Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

Bharadwaj Yadavalli

TL;DR

Dynamic Template Selection (DTS) addresses the cost inefficiency of uniform prompting in LLM deployments by routing queries to templates of appropriate verbosity. The authors compare an MLP router based on embeddings to a fine-tuned RoBERTa transformer, finding the lightweight MLP achieves 90.5% routing accuracy with substantially fewer parameters, while maintaining comparable performance to the transformer. DTS demonstrates provider-agnostic generalization across OpenAI, Gemini, and Claude, delivering about 33% reductions in costly output tokens in production API calls. The work provides formal problem formulation, four algorithmic approaches with complexity analyses, and extensive empirical validation, illustrating significant practical cost savings without compromising output quality. Overall, DTS offers a scalable, architecture-agnostic strategy for production LLM cost optimization through adaptive, query-aware prompting.

Abstract

Contemporary large language model deployments typically employ uniform prompting strategies across diverse query types, applying verbose response patterns to both complex analytical tasks and straightforward factual questions. This one-size-fits-all methodology leads to substantial token inefficiency, a concern amplified by the significant cost differential between input and output tokens--the latter commanding 4-8x higher prices across major providers. We present Dynamic Template Selection (DTS), which adaptively matches response templates to query complexity, achieving significant cost reductions without compromising response quality. We compared two routing approaches: a simple MLP that uses pre-computed embeddings and a more complex fine-tuned RoBERTa transformer. Through comprehensive evaluation on 1,000 MMLU questions, we find that the MLP router achieves 90.5% routing accuracy on held-out test data, marginally exceeding RoBERTa's performance (89.5%) despite utilizing 125M fewer parameters. Notably, our empirical analysis reveals provider-agnostic behavior in template selection--routing decisions generalize effectively across 3 major LLM providers (OpenAI GPT-4, Google Gemini, and Anthropic Claude), as validated through 9,000 production API calls. While routing accuracy remains consistent at 90.5% across providers, observed token reductions vary from 32.6% to 33.9%, reflecting provider-specific generation characteristics. This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.

Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

TL;DR

Abstract

Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (5)