A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding
Mengjingcheng Mo, Xinyang Tong, Mingpi Tan, Jiaxu Leng, Jiankang Zheng, Yiran Liu, Haosheng Chen, Ji Gan, Weisheng Li, Xinbo Gao
TL;DR
A2Seek tackles the challenge of aerial anomaly understanding by providing a large-scale, multimodal UAV benchmark with fine-grained region localization and natural language reasoning annotations. The authors propose A2Seek-R1, a two-stage framework that first activates latent reasoning via Graph-of-Thought guided supervised fine-tuning and then optimizes reasoning and localization through aerial-specific reinforcement fine-tuning with the A-GRPO policy and a seeking mechanism that mimics UAV behavior. Empirical results show substantial gains in both anomaly detection accuracy (AP_c) and localization (mIoU), along with strong language-grounded reasoning metrics and robust out-of-domain generalization. The work delivers a dataset and a reasoning-centric paradigm that advances interpretable, region-aware aerial anomaly understanding with practical implications for public safety and surveillance, while outlining ethical considerations and directions for future enhancement.
Abstract
While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introduce A2Seek (Aerial Anomaly Seek), a large-scale, reasoning-centric benchmark dataset for aerial anomaly understanding. This dataset covers various scenarios and environmental conditions, providing high-resolution real-world aerial videos with detailed annotations, including anomaly categories, frame-level timestamps, region-level bounding boxes, and natural language explanations for causal reasoning. Building on this dataset, we propose A2Seek-R1, a novel reasoning framework that generalizes R1-style strategies to aerial anomaly understanding, enabling a deeper understanding of "Where" anomalies occur and "Why" they happen in aerial frames. To this end, A2Seek-R1 first employs a graph-of-thought (GoT)-guided supervised fine-tuning approach to activate the model's latent reasoning capabilities on A2Seek. Then, we introduce Aerial Group Relative Policy Optimization (A-GRPO) to design rule-based reward functions tailored to aerial scenarios. Furthermore, we propose a novel "seeking" mechanism that simulates UAV flight behavior by directing the model's attention to informative regions. Extensive experiments demonstrate that A2Seek-R1 achieves up to a 22.04% improvement in AP for prediction accuracy and a 13.9% gain in mIoU for anomaly localization, exhibiting strong generalization across complex environments and out-of-distribution scenarios. Our dataset and code are released at https://2-mo.github.io/A2Seek/.
