Table of Contents
Fetching ...

TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications

Neiwen Ling, Guojun Chen, Lin Zhong

TL;DR

TimelyLLM tackles the latency-urgency gap in multi-agent robotic control by introducing segmentation of LLM generation and a time-aware scheduling policy. It leverages the redundancy between rapid plan generation and slower robot execution to suspend and resume generation, prioritizing segments that maximize time utility via a PUD-based scheme. Implemented on vLLM with a custom stop checker and KV-cache resumption, TimelyLLM demonstrates up to $1.97\times$ time utility improvement and an $84\%$ reduction in waiting time across drone, robot-arm, and chatbot tasks, using the LRTrace dataset for realistic workloads. This approach enables scalable, real-time LLM serving for time-sensitive robotics, with broad implications for multi-agent AI systems that must operate under strict deadlines.

Abstract

Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend complex commands and process diverse tasks. This advancement facilitates their application in controlling drones and robots for various tasks. However, existing LLM serving systems typically employ a first-come, first-served (FCFS) batching mechanism, which fails to address the time-sensitive requirements of robotic applications. To address it, this paper proposes a new system named TimelyLLM serving multiple robotic agents with time-sensitive requests. TimelyLLM introduces novel mechanisms of segmented generation and scheduling that optimally leverage redundancy between robot plan generation and execution phases. We report an implementation of TimelyLLM on a widely-used LLM serving framework and evaluate it on a range of robotic applications. Our evaluation shows that TimelyLLM improves the time utility up to 1.97x, and reduces the overall waiting time by 84%.

TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications

TL;DR

TimelyLLM tackles the latency-urgency gap in multi-agent robotic control by introducing segmentation of LLM generation and a time-aware scheduling policy. It leverages the redundancy between rapid plan generation and slower robot execution to suspend and resume generation, prioritizing segments that maximize time utility via a PUD-based scheme. Implemented on vLLM with a custom stop checker and KV-cache resumption, TimelyLLM demonstrates up to time utility improvement and an reduction in waiting time across drone, robot-arm, and chatbot tasks, using the LRTrace dataset for realistic workloads. This approach enables scalable, real-time LLM serving for time-sensitive robotics, with broad implications for multi-agent AI systems that must operate under strict deadlines.

Abstract

Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend complex commands and process diverse tasks. This advancement facilitates their application in controlling drones and robots for various tasks. However, existing LLM serving systems typically employ a first-come, first-served (FCFS) batching mechanism, which fails to address the time-sensitive requirements of robotic applications. To address it, this paper proposes a new system named TimelyLLM serving multiple robotic agents with time-sensitive requests. TimelyLLM introduces novel mechanisms of segmented generation and scheduling that optimally leverage redundancy between robot plan generation and execution phases. We report an implementation of TimelyLLM on a widely-used LLM serving framework and evaluate it on a range of robotic applications. Our evaluation shows that TimelyLLM improves the time utility up to 1.97x, and reduces the overall waiting time by 84%.

Paper Structure

This paper contains 28 sections, 17 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: System overview of TimelyLLM: A robotic agent submits a request with time-sensitive requirement defined by TUF (§\ref{['subsec:app_llm']}) to TimelyLLM. TimelyLLM generates the response for incoming requests in segments by continuously suspending and resuming the generation (§\ref{['sec:con_infer']}). The scheduler in TimelyLLM (§\ref{['sec:seg_schedule']}) manages all the initial and suspended generation and (§\ref{['sec:priority_assign']}) prioritizes them based on the potential time utility gain and estimated execution time. Additionally, TimelyLLM adaptively adjusts batch sizes to efficiently harness GPU parallel computing resources (§\ref{['sec:batching']}).
  • Figure 2: A Case Comparison of Normal Generation vs. Context-aware Segmented Generation: Under normal generation mode, Urgent Request 1 cannot be executed in time due to the blocking of Normal Request 0 generation. Content-aware segmented generation optimizes this by releasing resources upon completing Segment 0 of Request 0, thereby allowing the system to process Urgent Request 1. Once Request 1 is processed, the system resumes and completes the subsequent segment of Request 0, which is finished before the completion of Segment 0 execution. There is no network latency shown before Segment 1 and Segment 2 execution since they are paralleled with the execution of the previous segment.
  • Figure 3: Three user-perceptible latencies for a robotic request: (i) Request response time $W(s_0)$: the time taken to perform the first action, which is also the waiting time of segment 0. (ii) Robot waiting time $\sum_{k=0}^{K}W(s_{k})$, $W(s_0)+ W(s_1)$ in the figure: the cumulative waiting time introduced by LLM planning, represented as the sum of waiting time for all segments. For normal LLM generation, this equals the request response time. (iii) Task completion time $C(r)$: the total duration encompassing the robot waiting time and execution time.
  • Figure 4: Robots used in data collection: We utilize a Ryze Tech Tello drone dji2023tellosdk and a Neuromeka Indy7 Pro robotic arm robotarmindy. These platforms are used to profile the actual execution time of various robotic skills.
  • Figure 5: End-to-end Performance under different levels of resource contention. (LW: Low Workloads, HW: High Workloads). Based on (a) and (d), TimelyLLM improves the utility significantly over the vLLM baseline on urgent requests. Additionally, the remaining figures demonstrate that TimelyLLM reduces the response time and waiting time across most tasks.
  • ...and 5 more figures