Table of Contents
Fetching ...

FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning

Hao Mark Chen, Zhiwen Mo, Guanxi Lu, Shuang Liang, Lingxiao Ma, Wayne Luk, Hongxiang Fan

TL;DR

FastTTS addresses the practical challenge of running reasoning-enabled edge LLMs under tight memory by introducing three complementary system optimizations: Speculative Beam Extension to hide irregular workload latency, Dynamic Prefix-Aware Scheduling to maximize KV-cache reuse, and Asymmetric Multi-Model Memory Allocation to balance memory between generator and verifier. Implemented as a plug-in on top of vLLM, FastTTS delivers substantial edge-performance gains, achieving around 2.2× higher goodput and 38–68% lower latency on memory-constrained GPUs while maintaining algorithmic equivalence with baseline TTS methods. The approach scales across model pairings and hardwares, including lower-end edge GPUs, and extends to code-generation benchmarks, underscoring its practical impact for democratizing agentic AI at the edge. Overall, FastTTS demonstrates that careful system-level optimizations can close the gap between edge and cloud reasoning while preserving accuracy and responsiveness in real-world tasks.

Abstract

Recent advances in reasoning Large Language Models (LLMs) are driving the emergence of agentic AI systems. Edge deployment of LLM agents near end users is increasingly necessary to protect data privacy, enable offline use, and provide responsive interaction with local context. However, strict memory constraints on edge devices limit deployment to smaller LLMs, whose reasoning capabilities are much weaker than those of large cloud models, hindering practical deployment of edge agentic AI. Test-Time Scaling (TTS) offers a promising solution by allocating more compute during inference to enhance the reasoning capability of edge LLMs. However, current TTS methods introduce heavy hardware performance overhead on resource-constrained devices, making them impractical for real applications. To address this challenge, we present FastTTS, a serving system that enables fast and efficient TTS for memory-constrained LLM reasoning. After analyzing common patterns across various TTS methods and identifying their performance bottlenecks, we introduce three novel techniques: i) Speculative Beam Extension, which mitigates system stragglers caused by irregular reasoning paths, ii) Asymmetric Multi-Model Memory Allocation, which dynamically balances memory usage between token generation and reasoning-step verification, and iii) Dynamic Prefix-Aware Scheduling, which optimizes reasoning execution to maximize KV-cache reuse across search paths. FastTTS offers a plug-and-play third-party library on top of vLLM, enabling edge LLMs on a single consumer GPU to match cloud-model accuracy and cloud-measured latency. Comprehensive evaluation shows that FastTTS achieves an average 2.2x higher goodput and reduces latency by 38%--68% compared to the vLLM baseline; it pushes the boundaries of low-latency TTS on memory-constrained edge devices and highlights the potential for democratizing agentic AI.

FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning

TL;DR

FastTTS addresses the practical challenge of running reasoning-enabled edge LLMs under tight memory by introducing three complementary system optimizations: Speculative Beam Extension to hide irregular workload latency, Dynamic Prefix-Aware Scheduling to maximize KV-cache reuse, and Asymmetric Multi-Model Memory Allocation to balance memory between generator and verifier. Implemented as a plug-in on top of vLLM, FastTTS delivers substantial edge-performance gains, achieving around 2.2× higher goodput and 38–68% lower latency on memory-constrained GPUs while maintaining algorithmic equivalence with baseline TTS methods. The approach scales across model pairings and hardwares, including lower-end edge GPUs, and extends to code-generation benchmarks, underscoring its practical impact for democratizing agentic AI at the edge. Overall, FastTTS demonstrates that careful system-level optimizations can close the gap between edge and cloud reasoning while preserving accuracy and responsiveness in real-world tasks.

Abstract

Recent advances in reasoning Large Language Models (LLMs) are driving the emergence of agentic AI systems. Edge deployment of LLM agents near end users is increasingly necessary to protect data privacy, enable offline use, and provide responsive interaction with local context. However, strict memory constraints on edge devices limit deployment to smaller LLMs, whose reasoning capabilities are much weaker than those of large cloud models, hindering practical deployment of edge agentic AI. Test-Time Scaling (TTS) offers a promising solution by allocating more compute during inference to enhance the reasoning capability of edge LLMs. However, current TTS methods introduce heavy hardware performance overhead on resource-constrained devices, making them impractical for real applications. To address this challenge, we present FastTTS, a serving system that enables fast and efficient TTS for memory-constrained LLM reasoning. After analyzing common patterns across various TTS methods and identifying their performance bottlenecks, we introduce three novel techniques: i) Speculative Beam Extension, which mitigates system stragglers caused by irregular reasoning paths, ii) Asymmetric Multi-Model Memory Allocation, which dynamically balances memory usage between token generation and reasoning-step verification, and iii) Dynamic Prefix-Aware Scheduling, which optimizes reasoning execution to maximize KV-cache reuse across search paths. FastTTS offers a plug-and-play third-party library on top of vLLM, enabling edge LLMs on a single consumer GPU to match cloud-model accuracy and cloud-measured latency. Comprehensive evaluation shows that FastTTS achieves an average 2.2x higher goodput and reduces latency by 38%--68% compared to the vLLM baseline; it pushes the boundaries of low-latency TTS on memory-constrained edge devices and highlights the potential for democratizing agentic AI.

Paper Structure

This paper contains 59 sections, 17 equations, 18 figures, 1 algorithm.

Figures (18)

  • Figure 1: (a) Memory cost across models. (b) FastTTS enables low-latency edge deployment of reasoning models. Cloud accuracy: GPT-o1-preview. Edge accuracy: Qwen2.5-Math-1.5B. Cloud latency from the first-answer latency of GPT-o3-pro and GPT-5 (thinking models) artificialanalysis_leaderboard_reasoning_largeyang2025reasonflux.
  • Figure 2: Illustration of different TTS methods.
  • Figure 3: Left: Accuracy vs. latency for different TTS methods on MATH-500 datasets. Right: Avg. and max. token count per generation step of Qwen2.5-Math-1.5B on AIME.
  • Figure 4: GPU compute utilization in generation and verification phases over time. Irregular during the generation phase. The metrics are collected using Nsight Systems, NVIDIA’s official profiling tool, at a sampling rate of 10,000 samples per second for the Tensor Core utilization metrics.
  • Figure 5: Optimization opportunity in Dynamic Prefix-Cache Sharing. Left: Prefix-cache sharing enables potential substantial memory savings for different TTS methods. Right: Naive scheduling overlooks the dynamic nature of prefix-cache sharing.
  • ...and 13 more figures