TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications
Neiwen Ling, Guojun Chen, Lin Zhong
TL;DR
TimelyLLM tackles the latency-urgency gap in multi-agent robotic control by introducing segmentation of LLM generation and a time-aware scheduling policy. It leverages the redundancy between rapid plan generation and slower robot execution to suspend and resume generation, prioritizing segments that maximize time utility via a PUD-based scheme. Implemented on vLLM with a custom stop checker and KV-cache resumption, TimelyLLM demonstrates up to $1.97\times$ time utility improvement and an $84\%$ reduction in waiting time across drone, robot-arm, and chatbot tasks, using the LRTrace dataset for realistic workloads. This approach enables scalable, real-time LLM serving for time-sensitive robotics, with broad implications for multi-agent AI systems that must operate under strict deadlines.
Abstract
Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend complex commands and process diverse tasks. This advancement facilitates their application in controlling drones and robots for various tasks. However, existing LLM serving systems typically employ a first-come, first-served (FCFS) batching mechanism, which fails to address the time-sensitive requirements of robotic applications. To address it, this paper proposes a new system named TimelyLLM serving multiple robotic agents with time-sensitive requests. TimelyLLM introduces novel mechanisms of segmented generation and scheduling that optimally leverage redundancy between robot plan generation and execution phases. We report an implementation of TimelyLLM on a widely-used LLM serving framework and evaluate it on a range of robotic applications. Our evaluation shows that TimelyLLM improves the time utility up to 1.97x, and reduces the overall waiting time by 84%.
