TypeFly: Flying Drones with Large Language Model

Guojun Chen; Xiaojing Yu; Neiwen Ling; Lin Zhong

TypeFly: Flying Drones with Large Language Model

Guojun Chen, Xiaojing Yu, Neiwen Ling, Lin Zhong

TL;DR

TypeFly tackles the latency bottleneck of LLM-driven drone control by introducing MiniSpec, a token-efficient domain-specific language, and a stream-based runtime that interprets and executes plans as they are generated. The system couples an on-prem edge server, a vision module, and a cloud LLM with a prompt generator, enabling low-latency control through streaming execution, a probe mechanism for runtime reasoning, and an exception-handling facility (replan) for dynamic environments. Across 11 benchmark tasks, TypeFly demonstrates up to 62% reduction in response time and substantial token savings, with robust performance aided by MiniSpec, probe, and replan; however, limitations remain in geometric reasoning and memory of past scenes. The work highlights practical advances toward real-time, privacy-preserving, LLM-assisted drone control and suggests directions such as memory-enabled scene modeling and prompt-caching to further reduce latency and improve reliability.

Abstract

Recent advancements in robot control using large language models (LLMs) have demonstrated significant potential, primarily due to LLMs' capabilities to understand natural language commands and generate executable plans in various languages. However, in real-time and interactive applications involving mobile robots, particularly drones, the sequential token generation process inherent to LLMs introduces substantial latency, i.e. response time, in control plan generation. In this paper, we present a system called ChatFly that tackles this problem using a combination of a novel programming language called MiniSpec and its runtime to reduce the plan generation time and drone response time. That is, instead of asking an LLM to write a program (robotic plan) in the popular but verbose Python, ChatFly gets it to do it in MiniSpec specially designed for token efficiency and stream interpretation. Using a set of challenging drone tasks, we show that design choices made by ChatFly can reduce up to 62% response time and provide a more consistent user experience, enabling responsive and intelligent LLM-based drone control with efficient completion.

TypeFly: Flying Drones with Large Language Model

TL;DR

Abstract

Paper Structure (26 sections, 1 equation, 7 figures, 3 tables)

This paper contains 26 sections, 1 equation, 7 figures, 3 tables.

INTRODUCTION
Background
Latency
Response Mode
Related Work
System Overview
MiniSpec
Language Design
Concise and Limited Function Calls
Capable and Predictable Logic Control
Simplified Syntax for Stream Interpreting
Exception Handling
System Skills
Low-level skills.
High-level skills
...and 11 more sections

Figures (7)

Figure 1: System overview of TypeFly: An on-premise edge server controls the drone to accomplish a task described by a user, whether a human or a language agent, using natural language. Based on the task description and a scene description by a vision encoder, the LLM writes a program in MiniSpec (§\ref{['sec:language_design']}), called plan. In this example, the plan includes one elementary statement tc(180), which turns the drone by 180-degree, and one composite statement ?p("Any food target here?")!=False{g("Table")}, which moves the drone to a table, depending on whether there is food on it. With Stream Interpreting (§\ref{['sec:runtime']}), the drone can start to act on a statement while Remote LLM is still generating the next. TypeFly can deal with syntax errors and unexpected situations through MiniSpec's exception handling mechanism replan(§\ref{['sec:replan']}). Additionally, using a special skill probe (§\ref{['sec:system-skills']}), TypeFly can engage the LLM during the plan execution.
Figure 2: GPT4 API latency regards different input and output token numbers (each point is the average of 10 measurements). The top figure represents measurements taken with changing input token numbers while keeping the output token constant; The bottom figure is measured with various output token numbers and fixed input token numbers. The trend lines suggest that $b\approx 2800a$. Despite the low correlation coefficient between latency and the number of input tokens, we can still conclude an estimation that generating output tokens is more than 1000 times slower than processing input tokens. (The measurements were conducted on March 5, 2024, using gpt-4 model)
Figure 3: Normal vs. Stream Interpreting of a MiniSpec plan and MiniSpec parsing. In Normal Interpreting, TypeFly waits for the whole plan to be received from Remote LLM to start translation and execution. The response time is highly related to the length of the plan and makes the drone less responsive when the plan is long. In Stream Interpreting, the response time is reduced to receiving the first executable part of the plan. The bottom half shows the executable part for different types of statements. Note that the network latency is omitted in the figure.
Figure 4: Evaluation setup and the screenshot of TypeFly interface. We use a cheap off-the-shelf drone with video streaming and programmable control API in our evaluation, showing the potential of TypeFly's portability for other kinds of robots. The experiments are done inside a typical office area without any external infrastructure except for our edge server and WiFi.
Figure 5: Response time comparison of the system using MiniSpec with Stream Interpreting, MiniSpec with Normal Interpreting, and Python with Normal Interpreting. The Python plan adheres to the same logic as the MiniSpec plan. Using MiniSpec results in at most $32\%$ reduction in response time and further employ Stream Interpreting can achieve up to $62\%$ response time reduction as well as provide a more consistent performance when compared with the Python baseline.
...and 2 more figures

TypeFly: Flying Drones with Large Language Model

TL;DR

Abstract

TypeFly: Flying Drones with Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (7)