Table of Contents
Fetching ...

QuadAgent: A Responsive Agent System for Vision-Language Guided Quadrotor Agile Flight

Ao Zhuang, Feng Yu, Tianbao Zhang, Linzuo Zhang, Danping Zou

Abstract

We present QuadAgent, a training-free agent system for agile quadrotor flight guided by vision-language inputs. Unlike prior end-to-end or serial agent approaches, QuadAgent decouples high-level reasoning from low-level control using an asynchronous multi-agent architecture: Foreground Workflow Agents handle active tasks and user commands, while Background Agents perform look-ahead reasoning. The system maintains scene memory via the Impression Graph, a lightweight topological map built from sparse keyframes, and ensures safe flight with a vision-based obstacle avoidance network. Simulation results show that QuadAgent outperforms baseline methods in efficiency and responsiveness. Real-world experiments demonstrate that it can interpret complex instructions, reason about its surroundings, and navigate cluttered indoor spaces at speeds up to 5 m/s.

QuadAgent: A Responsive Agent System for Vision-Language Guided Quadrotor Agile Flight

Abstract

We present QuadAgent, a training-free agent system for agile quadrotor flight guided by vision-language inputs. Unlike prior end-to-end or serial agent approaches, QuadAgent decouples high-level reasoning from low-level control using an asynchronous multi-agent architecture: Foreground Workflow Agents handle active tasks and user commands, while Background Agents perform look-ahead reasoning. The system maintains scene memory via the Impression Graph, a lightweight topological map built from sparse keyframes, and ensures safe flight with a vision-based obstacle avoidance network. Simulation results show that QuadAgent outperforms baseline methods in efficiency and responsiveness. Real-world experiments demonstrate that it can interpret complex instructions, reason about its surroundings, and navigate cluttered indoor spaces at speeds up to 5 m/s.

Paper Structure

This paper contains 14 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure A1: Complex reasoning using our agent system. Given the identical conditional task, the left and right panels illustrate the agent's behavior depending on observations: the agent navigates to the badminton net through random obstacles when "yellow sign number 7" is observed (left), and proceeds to the white table in front of the yellow frame when it is not (right).
  • Figure A2: System Overview. In Foreground Workflow Agents, the orchestrator monitors events ($\epsilon_{usr}, \epsilon_{phy}$) in the idle state and routes tasks to the planner or executor. Both the executor and pre-executor autonomously call mnemonic, navigation, or perceptual skills from the skill library. Notably, the navigation skill triggers $\epsilon_{path}$ to drive the onboard physical layer state machine for actuation, transitioning among the following states: $q_{hover}$ (hovering), $q_{align\_s}$ (in-place rotation to align with the initial path heading), $q_{avoid}$ (path tracking via collision avoidance policy), and $q_{align\_e}$ (in-place rotation to the final target orientation). The Impression Graph at the top right underpins mnemonic and navigation skills as the scene prior representation. Images in the Impression Graph and the physical layer show the arena and UAV used in real-world experiments.
  • Figure C1: Timelines of Typical Cases. In each sub-image, the upper rows depict the Foreground Workflow Agents' lifecycle, where $q_{idle}$ marks the idle state, $q_{orc}$ indicates the orchestrator's active routing phase, and $q_{plan}/q_{exec}$ denote the engagement of the planner and executor, respectively. The lower rows track the physical layer states. (a) Our suspend-and-resume protocol yields the agent to the idle state ($q_{idle}$) immediately after triggering $\epsilon_{path}$, enabling "Chatting-while-Flying", unlike the Blocking Baseline (b) where the agent is blocked until physical completion. (c) Informational Queries are resolved in parallel with the ongoing flight state ($q_{avoid}$). (d) Conflicting Commands trigger preemptive replanning, issuing a new $\epsilon_{path}$ to immediately redirect the UAV.
  • Figure C2: Data Flow of Background Pre-execution. The pipeline branches from the task registry (center left). While the Foreground Workflow Agents execute the task from the first one, upcoming sub-tasks are assigned across Background Agents (bottom) for pre-execution using mnemonic skills. The retrieved context is cached in a shared buffer (center right) and dynamically injected into the prompt of the Executor (top right).
  • Figure C3: Impression Graph Construction Pipeline. (a) Topological Connectivity: The depth map is tessellated into patches and projected into geometric pyramidal frustums. Edges $(n_i, n_j)$ are established solely if the volumetric intersection of their frustums exceeds $\sigma_{vol}$. (b) Semantic Generation:$I_{rgb}$ is segmented into depth-stratified views (Near, Far, Full) and concatenated into a composite input for the VLM.
  • ...and 1 more figures