FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning
Hao Mark Chen, Zhiwen Mo, Guanxi Lu, Shuang Liang, Lingxiao Ma, Wayne Luk, Hongxiang Fan
TL;DR
FastTTS addresses the practical challenge of running reasoning-enabled edge LLMs under tight memory by introducing three complementary system optimizations: Speculative Beam Extension to hide irregular workload latency, Dynamic Prefix-Aware Scheduling to maximize KV-cache reuse, and Asymmetric Multi-Model Memory Allocation to balance memory between generator and verifier. Implemented as a plug-in on top of vLLM, FastTTS delivers substantial edge-performance gains, achieving around 2.2× higher goodput and 38–68% lower latency on memory-constrained GPUs while maintaining algorithmic equivalence with baseline TTS methods. The approach scales across model pairings and hardwares, including lower-end edge GPUs, and extends to code-generation benchmarks, underscoring its practical impact for democratizing agentic AI at the edge. Overall, FastTTS demonstrates that careful system-level optimizations can close the gap between edge and cloud reasoning while preserving accuracy and responsiveness in real-world tasks.
Abstract
Recent advances in reasoning Large Language Models (LLMs) are driving the emergence of agentic AI systems. Edge deployment of LLM agents near end users is increasingly necessary to protect data privacy, enable offline use, and provide responsive interaction with local context. However, strict memory constraints on edge devices limit deployment to smaller LLMs, whose reasoning capabilities are much weaker than those of large cloud models, hindering practical deployment of edge agentic AI. Test-Time Scaling (TTS) offers a promising solution by allocating more compute during inference to enhance the reasoning capability of edge LLMs. However, current TTS methods introduce heavy hardware performance overhead on resource-constrained devices, making them impractical for real applications. To address this challenge, we present FastTTS, a serving system that enables fast and efficient TTS for memory-constrained LLM reasoning. After analyzing common patterns across various TTS methods and identifying their performance bottlenecks, we introduce three novel techniques: i) Speculative Beam Extension, which mitigates system stragglers caused by irregular reasoning paths, ii) Asymmetric Multi-Model Memory Allocation, which dynamically balances memory usage between token generation and reasoning-step verification, and iii) Dynamic Prefix-Aware Scheduling, which optimizes reasoning execution to maximize KV-cache reuse across search paths. FastTTS offers a plug-and-play third-party library on top of vLLM, enabling edge LLMs on a single consumer GPU to match cloud-model accuracy and cloud-measured latency. Comprehensive evaluation shows that FastTTS achieves an average 2.2x higher goodput and reduces latency by 38%--68% compared to the vLLM baseline; it pushes the boundaries of low-latency TTS on memory-constrained edge devices and highlights the potential for democratizing agentic AI.
