Table of Contents
Fetching ...

Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

Hao Kang, Qingru Zhang, Han Cai, Weiyuan Xu, Tushar Krishna, Yilun Du, Tsachy Weissman

TL;DR

This work presents FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands, and reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance.

Abstract

Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency quality trade off, it remains underexplored in the context of LLM based agents. In this work, we present the first systematic study of this trade off in real time decision making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency aware evaluation and deployment strategies for LLM based agents. These results demonstrate the critical importance of latency aware evaluation and deployment strategies for real world LLM based agents. Our benchmarks are available at Latency Sensitive Benchmarks.

Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

TL;DR

This work presents FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands, and reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance.

Abstract

Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency quality trade off, it remains underexplored in the context of LLM based agents. In this work, we present the first systematic study of this trade off in real time decision making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency aware evaluation and deployment strategies for LLM based agents. These results demonstrate the critical importance of latency aware evaluation and deployment strategies for real world LLM based agents. Our benchmarks are available at Latency Sensitive Benchmarks.

Paper Structure

This paper contains 24 sections, 7 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Latency–accuracy trade-offs across different model configurations and tasks. (a) fpx enables a smooth and continuous trade-off between latency and accuracy, allowing models to meet diverse task-specific requirements. (b) In the Street Fighter benchmark, win rate first increases as latency decreases, peaking at a Pareto-optimal point, before dropping due to excessive accuracy loss. (c) Observation in HFTBench: daily yield improves with moderate latency reduction, but degrades when model accuracy is overly compromised.
  • Figure 2: Comparison of agentic LLM for Static environments like code generateion or research and time sensitive environments like trading and gaming. Environment is constantly changing with time and other agent's interaction. For such tasks, reward is related to both quality and latency of agents.
  • Figure 3: Visualizations of HFTBench testing data.