Table of Contents
Fetching ...

UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

Xiangyu Wang, Donglin Yang, Yue Liao, Wenhao Zheng, wenjun wu, Bin Dai, Hongsheng Li, Si Liu

TL;DR

The paper addresses the challenge of language-driven, fine-grained UAV control in real-world environments. It proposes Flow as a task and introduces UAV-Flow, a benchmark consisting of a real-world dataset, a ground-drone deployment framework, and a simulation suite. Experiments show that VLA-based approaches outperform VLN baselines in short-range, language-conditioned control, with open vocabulary instructions improving generalization. This work enables direct deployment without a sim-to-real gap and provides a foundation for integrating language-driven control with real-time UAV manipulation.

Abstract

Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions. We formalize this problem as the Flying-on-a-Word (Flow) task and introduce UAV imitation learning as an effective approach. In this framework, UAVs learn fine-grained control policies by mimicking expert pilot trajectories paired with atomic language instructions. To support this paradigm, we present UAV-Flow, the first real-world benchmark for language-conditioned, fine-grained UAV control. It includes a task formulation, a large-scale dataset collected in diverse environments, a deployable control framework, and a simulation suite for systematic evaluation. Our design enables UAVs to closely imitate the precise, expert-level flight trajectories of human pilots and supports direct deployment without sim-to-real gap. We conduct extensive experiments on UAV-Flow, benchmarking VLN and VLA paradigms. Results show that VLA models are superior to VLN baselines and highlight the critical role of spatial grounding in the fine-grained Flow setting.

UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

TL;DR

The paper addresses the challenge of language-driven, fine-grained UAV control in real-world environments. It proposes Flow as a task and introduces UAV-Flow, a benchmark consisting of a real-world dataset, a ground-drone deployment framework, and a simulation suite. Experiments show that VLA-based approaches outperform VLN baselines in short-range, language-conditioned control, with open vocabulary instructions improving generalization. This work enables direct deployment without a sim-to-real gap and provides a foundation for integrating language-driven control with real-time UAV manipulation.

Abstract

Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions. We formalize this problem as the Flying-on-a-Word (Flow) task and introduce UAV imitation learning as an effective approach. In this framework, UAVs learn fine-grained control policies by mimicking expert pilot trajectories paired with atomic language instructions. To support this paradigm, we present UAV-Flow, the first real-world benchmark for language-conditioned, fine-grained UAV control. It includes a task formulation, a large-scale dataset collected in diverse environments, a deployable control framework, and a simulation suite for systematic evaluation. Our design enables UAVs to closely imitate the precise, expert-level flight trajectories of human pilots and supports direct deployment without sim-to-real gap. We conduct extensive experiments on UAV-Flow, benchmarking VLN and VLA paradigms. Results show that VLA models are superior to VLN baselines and highlight the critical role of spatial grounding in the fine-grained Flow setting.

Paper Structure

This paper contains 21 sections, 1 equation, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Overview of our UAV-Flow benchmark. It consists of a large-scale real-world dataset for language-conditioned UAV imitation learning, featuring multiple UAV platforms, diverse environments, and a wide range of fine-grained flight skill tasks. To enable systematic experimental analysis under the Flow task setting, we additionally provide a simulation-based evaluation protocol and deploy VLA models on real UAVs. To the best of our knowledge, this is the first real-world deployment of VLA models for language-guided UAV control in open environments.
  • Figure 2: Analysis of traditional UAV VLN and our Flow.Left: VLN tasks aim to reach distant goals by planning long-horizon paths from instructions. Right: Flow focuses on executing short-range, language-guided trajectories toward visually grounded targets within the current scene.
  • Figure 3: Visualization of Flow tasks. Given the same instruction, human pilots execute diverse real-world trajectories. We show 2D flight paths over aerial scenes and reconstructed 3D trajectories.
  • Figure 4: Real-world UAV data collection pipeline.
  • Figure 5: Dataset statistics for UAV-Flow and UAV-Flow-Sim. We show the distribution of task types (by percentage) and trajectory distances across both datasets.
  • ...and 9 more figures