Table of Contents
Fetching ...

A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding

Mengjingcheng Mo, Xinyang Tong, Mingpi Tan, Jiaxu Leng, Jiankang Zheng, Yiran Liu, Haosheng Chen, Ji Gan, Weisheng Li, Xinbo Gao

TL;DR

A2Seek tackles the challenge of aerial anomaly understanding by providing a large-scale, multimodal UAV benchmark with fine-grained region localization and natural language reasoning annotations. The authors propose A2Seek-R1, a two-stage framework that first activates latent reasoning via Graph-of-Thought guided supervised fine-tuning and then optimizes reasoning and localization through aerial-specific reinforcement fine-tuning with the A-GRPO policy and a seeking mechanism that mimics UAV behavior. Empirical results show substantial gains in both anomaly detection accuracy (AP_c) and localization (mIoU), along with strong language-grounded reasoning metrics and robust out-of-domain generalization. The work delivers a dataset and a reasoning-centric paradigm that advances interpretable, region-aware aerial anomaly understanding with practical implications for public safety and surveillance, while outlining ethical considerations and directions for future enhancement.

Abstract

While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introduce A2Seek (Aerial Anomaly Seek), a large-scale, reasoning-centric benchmark dataset for aerial anomaly understanding. This dataset covers various scenarios and environmental conditions, providing high-resolution real-world aerial videos with detailed annotations, including anomaly categories, frame-level timestamps, region-level bounding boxes, and natural language explanations for causal reasoning. Building on this dataset, we propose A2Seek-R1, a novel reasoning framework that generalizes R1-style strategies to aerial anomaly understanding, enabling a deeper understanding of "Where" anomalies occur and "Why" they happen in aerial frames. To this end, A2Seek-R1 first employs a graph-of-thought (GoT)-guided supervised fine-tuning approach to activate the model's latent reasoning capabilities on A2Seek. Then, we introduce Aerial Group Relative Policy Optimization (A-GRPO) to design rule-based reward functions tailored to aerial scenarios. Furthermore, we propose a novel "seeking" mechanism that simulates UAV flight behavior by directing the model's attention to informative regions. Extensive experiments demonstrate that A2Seek-R1 achieves up to a 22.04% improvement in AP for prediction accuracy and a 13.9% gain in mIoU for anomaly localization, exhibiting strong generalization across complex environments and out-of-distribution scenarios. Our dataset and code are released at https://2-mo.github.io/A2Seek/.

A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding

TL;DR

A2Seek tackles the challenge of aerial anomaly understanding by providing a large-scale, multimodal UAV benchmark with fine-grained region localization and natural language reasoning annotations. The authors propose A2Seek-R1, a two-stage framework that first activates latent reasoning via Graph-of-Thought guided supervised fine-tuning and then optimizes reasoning and localization through aerial-specific reinforcement fine-tuning with the A-GRPO policy and a seeking mechanism that mimics UAV behavior. Empirical results show substantial gains in both anomaly detection accuracy (AP_c) and localization (mIoU), along with strong language-grounded reasoning metrics and robust out-of-domain generalization. The work delivers a dataset and a reasoning-centric paradigm that advances interpretable, region-aware aerial anomaly understanding with practical implications for public safety and surveillance, while outlining ethical considerations and directions for future enhancement.

Abstract

While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introduce A2Seek (Aerial Anomaly Seek), a large-scale, reasoning-centric benchmark dataset for aerial anomaly understanding. This dataset covers various scenarios and environmental conditions, providing high-resolution real-world aerial videos with detailed annotations, including anomaly categories, frame-level timestamps, region-level bounding boxes, and natural language explanations for causal reasoning. Building on this dataset, we propose A2Seek-R1, a novel reasoning framework that generalizes R1-style strategies to aerial anomaly understanding, enabling a deeper understanding of "Where" anomalies occur and "Why" they happen in aerial frames. To this end, A2Seek-R1 first employs a graph-of-thought (GoT)-guided supervised fine-tuning approach to activate the model's latent reasoning capabilities on A2Seek. Then, we introduce Aerial Group Relative Policy Optimization (A-GRPO) to design rule-based reward functions tailored to aerial scenarios. Furthermore, we propose a novel "seeking" mechanism that simulates UAV flight behavior by directing the model's attention to informative regions. Extensive experiments demonstrate that A2Seek-R1 achieves up to a 22.04% improvement in AP for prediction accuracy and a 13.9% gain in mIoU for anomaly localization, exhibiting strong generalization across complex environments and out-of-distribution scenarios. Our dataset and code are released at https://2-mo.github.io/A2Seek/.

Paper Structure

This paper contains 30 sections, 14 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Overview of the A2Seek Benchmark. (a) Challenges in aerial anomaly detection. Traditional methods rely on static surveillance views and focus mainly on classification, making it difficult to answer "Where" and "Why" anomalies occur under dynamic UAV perspectives. (b) Dataset statistics on multiple dimensions. (c) Reasoning pipeline. The method consists of two stages: SFT (supervised fine-tuning) for reasoning activation, and RL (reinforcement learning) for dynamic reasoning. (d) High-frequency word of dataset. (e) Reasoning process. The framework integrates multiple reasoning stages (Trigger, Diagnose, Reasoning, Reflection and Seeking), emphasizing reasoning-driven anomaly understanding. (f) Performance comparison.
  • Figure 2: Comparison of scene diversity and complexity. Left: fixed-view surveillance datasets. Right: diverse aerial views in A2Seek.
  • Figure 3: Performance comparison of different settings on A2Seek benchmark.
  • Figure 4: Qualitative results of A2Seek-R1. Beyond predicting anomaly categories, our method provides reasoning traces and accurately localizes the key regions that support its judgment.
  • Figure 5: Representative anomaly types in the A2Seek dataset. Our dataset covers a broad spectrum of anomalous behaviors across different risk levels, highlighting the diversity and complexity of aerial anomaly detection.
  • ...and 8 more figures