Table of Contents
Fetching ...

SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining

Ruiqi Xian, Xiyang Wu, Tianrui Guan, Xijun Wang, Boqing Gong, Dinesh Manocha

TL;DR

This work addresses UAV action recognition by shifting object knowledge into self-supervised pretraining. It introduces SOAR, a ViT-based masked autoencoder that uses object-aware masking and an object-aware loss to focus learning on human-related regions, enabling efficient pretraining and improved downstream accuracy. SOAR achieves state-of-the-art results on NEC-Drone and UAV-Human, with substantial reductions in pretraining time and memory and fast inference (18.7 ms per video). The approach reduces reliance on heavy annotation and inference-time detection, offering a practical, efficient pathway to robust UAV video understanding.

Abstract

We introduce SOAR, a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs). We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance. This is in contrast to prior works that primarily incorporate object information during the fine-tuning stage. Specifically, we first propose a novel object-aware masking strategy designed to retain the visibility of certain patches related to objects throughout the pretraining phase. Second, we introduce an object-aware loss function that utilizes object information to adjust the reconstruction loss, preventing bias towards less informative background patches. In practice, SOAR with a vanilla ViT backbone, outperforms best UAV action recognition models, recording a 9.7% and 21.4% boost in top-1 accuracy on the NEC-Drone and UAV-Human datasets, while delivering an inference speed of 18.7ms per video, making it 2x to 5x faster. Additionally, SOAR obtains comparable accuracy to prior self-supervised learning (SSL) methods while requiring 87.5% less pretraining time and 25% less memory usage

SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining

TL;DR

This work addresses UAV action recognition by shifting object knowledge into self-supervised pretraining. It introduces SOAR, a ViT-based masked autoencoder that uses object-aware masking and an object-aware loss to focus learning on human-related regions, enabling efficient pretraining and improved downstream accuracy. SOAR achieves state-of-the-art results on NEC-Drone and UAV-Human, with substantial reductions in pretraining time and memory and fast inference (18.7 ms per video). The approach reduces reliance on heavy annotation and inference-time detection, offering a practical, efficient pathway to robust UAV video understanding.

Abstract

We introduce SOAR, a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs). We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance. This is in contrast to prior works that primarily incorporate object information during the fine-tuning stage. Specifically, we first propose a novel object-aware masking strategy designed to retain the visibility of certain patches related to objects throughout the pretraining phase. Second, we introduce an object-aware loss function that utilizes object information to adjust the reconstruction loss, preventing bias towards less informative background patches. In practice, SOAR with a vanilla ViT backbone, outperforms best UAV action recognition models, recording a 9.7% and 21.4% boost in top-1 accuracy on the NEC-Drone and UAV-Human datasets, while delivering an inference speed of 18.7ms per video, making it 2x to 5x faster. Additionally, SOAR obtains comparable accuracy to prior self-supervised learning (SSL) methods while requiring 87.5% less pretraining time and 25% less memory usage
Paper Structure (13 sections, 2 equations, 5 figures, 3 tables)

This paper contains 13 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Typical UAV video Datasets. Example frames from two UAV video datasets: UAV-Human (top) and NEC-Drone (bottom).
  • Figure 2: Overview of SOAR. SOAR uses an asymmetric encoder-decoder architecture to mask random video patches and reconstruct the missing ones, while leveraging object information to optimize the reconstruction. It takes both video frames and object detections as input. It first generates a center heatmap for each frame using 2D Gaussians for each bounding box. These heatmaps are then temporally stacked, and pixel values within patches are summed to create an objectness score map. This map serves a dual purpose: guiding the object-aware masking strategy to ensure balanced patch masking and contributing to the object-aware loss function to reweigh the reconstruction loss.
  • Figure 3: Our Object-Aware Masking Strategy. We first render the patch-level object core map from the center heatmap, then sort all the patches based on their corresponding objectness scores. The sorted patches are divided into segments of equal length. Within each segment, one patch is randomly chosen to remain unmasked, while the remaining patches are masked. Finally, the generated mask is replicated across the temporal dimension to avoid information leakage.
  • Figure 4: Top-1 Accuracy under Different Mask Ratios. Contrary to findings from previous studies, a mask ratio of around 70% yields the best accuracy for UAV data.
  • Figure 5: Time Efficiency Comparison. SOAR converges much faster during pretraining and shows comparable results with pretraining 87.5% fewer epochs (81.7% accuracy when pretraining 50 epochs vs. 81.2% with 400 epochs).