Table of Contents
Fetching ...

AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

Fei Lin, Yonglin Tian, Tengchao Zhang, Jun Huang, Sangtian Guan, Fei-Yue Wang

TL;DR

AirVista-II tackles the challenge of dynamic scene semantic understanding in UAVs by introducing an end-to-end agentic architecture that unifies perception, planning, and reasoning. The system handles three temporal forms—images, short videos, and long videos—via dedicated planning and execution workflows, including an adaptive keyframe extraction module for long videos. Key contributions include (1) a two-module agentic framework combining LLM-based planning with modality-specific execution, (2) an adaptive, data-driven keyframe selection strategy that improves semantic coverage, and (3) zero-shot validation on ERA, CapERA, and SynDrone demonstrating high-quality semantic understanding and QA. The work advances autonomous UAV decision-making and demonstrates practical viability for real-world dynamic-scene tasks.

Abstract

Unmanned Aerial Vehicles (UAVs) are increasingly important in dynamic environments such as logistics transportation and disaster response. However, current tasks often rely on human operators to monitor aerial videos and make operational decisions. This mode of human-machine collaboration suffers from significant limitations in efficiency and adaptability. In this paper, we present AirVista-II -- an end-to-end agentic system for embodied UAVs, designed to enable general-purpose semantic understanding and reasoning in dynamic scenes. The system integrates agent-based task identification and scheduling, multimodal perception mechanisms, and differentiated keyframe extraction strategies tailored for various temporal scenarios, enabling the efficient capture of critical scene information. Experimental results demonstrate that the proposed system achieves high-quality semantic understanding across diverse UAV-based dynamic scenarios under a zero-shot setting.

AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

TL;DR

AirVista-II tackles the challenge of dynamic scene semantic understanding in UAVs by introducing an end-to-end agentic architecture that unifies perception, planning, and reasoning. The system handles three temporal forms—images, short videos, and long videos—via dedicated planning and execution workflows, including an adaptive keyframe extraction module for long videos. Key contributions include (1) a two-module agentic framework combining LLM-based planning with modality-specific execution, (2) an adaptive, data-driven keyframe selection strategy that improves semantic coverage, and (3) zero-shot validation on ERA, CapERA, and SynDrone demonstrating high-quality semantic understanding and QA. The work advances autonomous UAV decision-making and demonstrates practical viability for real-world dynamic-scene tasks.

Abstract

Unmanned Aerial Vehicles (UAVs) are increasingly important in dynamic environments such as logistics transportation and disaster response. However, current tasks often rely on human operators to monitor aerial videos and make operational decisions. This mode of human-machine collaboration suffers from significant limitations in efficiency and adaptability. In this paper, we present AirVista-II -- an end-to-end agentic system for embodied UAVs, designed to enable general-purpose semantic understanding and reasoning in dynamic scenes. The system integrates agent-based task identification and scheduling, multimodal perception mechanisms, and differentiated keyframe extraction strategies tailored for various temporal scenarios, enabling the efficient capture of critical scene information. Experimental results demonstrate that the proposed system achieves high-quality semantic understanding across diverse UAV-based dynamic scenarios under a zero-shot setting.

Paper Structure

This paper contains 12 sections, 1 equation, 3 figures, 5 tables, 2 algorithms.

Figures (3)

  • Figure 1: Execution pipeline of the AirVista-II system. For clarity, Long Video Agent, Short Video Agent, and Evaluation Agent are abbreviated as LV Agent, SV Agent, and EVAL Agent, respectively. Agents invoked $\geq 2$ times are highlighted in gray.
  • Figure 2: Word Cloud Visualization.
  • Figure 3: Clustering evaluation results on Town01: (a) Sum of Squared Errors (SSE) curve for Elbow Method, (b) Silhouette Score, (c) Davies--Bouldin Index, and (d) Calinski--Harabasz Index.