Table of Contents
Fetching ...

Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones

Sebastian-Ion Nae, Mihai-Eugen Barbu, Sebastian Mocanu, Marius Leordeanu

TL;DR

An indoor UAV video dataset with preserved temporal coherence and an evaluation of replay-based CIL under limited replay budgets and Gradient-weighted class activation mapping (Grad-CAM) analysis shows attention shifts across classes in mixed scenes, which is associated with reduced localization quality for drones.

Abstract

Autonomous agents such as indoor drones must learn new object classes in real-time while limiting catastrophic forgetting, motivating Class-Incremental Learning (CIL). However, most unmanned aerial vehicle (UAV) datasets focus on outdoor scenes and offer limited temporally coherent indoor videos. We introduce an indoor dataset of $14,400$ frames capturing inter-drone and ground vehicle footage, annotated via a semi-automatic workflow with a $98.6\%$ first-pass labeling agreement before final manual verification. Using this dataset, we benchmark 3 replay-based CIL strategies: Experience Replay (ER), Maximally Interfered Retrieval (MIR), and Forgetting-Aware Replay (FAR), using YOLOv11-nano as a resource-efficient detector for deployment-constrained UAV platforms. Under tight memory budgets ($5-10\%$ replay), FAR performs better than the rest, achieving an average accuracy (ACC, $mAP_{50-95}$ across increments) of $82.96\%$ with $5\%$ replay. Gradient-weighted class activation mapping (Grad-CAM) analysis shows attention shifts across classes in mixed scenes, which is associated with reduced localization quality for drones. The experiments further demonstrate that replay-based continual learning can be effectively applied to edge aerial systems. Overall, this work contributes an indoor UAV video dataset with preserved temporal coherence and an evaluation of replay-based CIL under limited replay budgets. Project page: https://spacetime-vision-robotics-laboratory.github.io/learning-on-the-fly-cl

Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones

TL;DR

An indoor UAV video dataset with preserved temporal coherence and an evaluation of replay-based CIL under limited replay budgets and Gradient-weighted class activation mapping (Grad-CAM) analysis shows attention shifts across classes in mixed scenes, which is associated with reduced localization quality for drones.

Abstract

Autonomous agents such as indoor drones must learn new object classes in real-time while limiting catastrophic forgetting, motivating Class-Incremental Learning (CIL). However, most unmanned aerial vehicle (UAV) datasets focus on outdoor scenes and offer limited temporally coherent indoor videos. We introduce an indoor dataset of frames capturing inter-drone and ground vehicle footage, annotated via a semi-automatic workflow with a first-pass labeling agreement before final manual verification. Using this dataset, we benchmark 3 replay-based CIL strategies: Experience Replay (ER), Maximally Interfered Retrieval (MIR), and Forgetting-Aware Replay (FAR), using YOLOv11-nano as a resource-efficient detector for deployment-constrained UAV platforms. Under tight memory budgets ( replay), FAR performs better than the rest, achieving an average accuracy (ACC, across increments) of with replay. Gradient-weighted class activation mapping (Grad-CAM) analysis shows attention shifts across classes in mixed scenes, which is associated with reduced localization quality for drones. The experiments further demonstrate that replay-based continual learning can be effectively applied to edge aerial systems. Overall, this work contributes an indoor UAV video dataset with preserved temporal coherence and an evaluation of replay-based CIL under limited replay budgets. Project page: https://spacetime-vision-robotics-laboratory.github.io/learning-on-the-fly-cl
Paper Structure (11 sections, 6 figures, 2 tables)

This paper contains 11 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Example pseudo-label outcomes. Left/right: successful detections. Middle: a rare miss where GroundingSAM fails to localize a drone. Such cases occurred in approximately $1.4\%$ of frames and were corrected during human review.
  • Figure 2: Task-wise $\text{mAP}_{50\text{-}95}$ for box detection and instance segmentation at $10\%$ (top) and $5\%$ (bottom) replay. Naïve fine-tuning shows forgetting on earlier tasks, while replay-based methods improve retention. FAR and MIR remain closest to the joint-training upper bound under these settings.
  • Figure 3: Grad-CAM highlighting a human attention pattern in mixed scenes.
  • Figure 4: Grad-CAM showing distributed, target-focused attention in UAV-only scenes.
  • Figure 5: Grad-CAM visualizations of the final conv layer across five sequential tasks at 5% replay buffer. Rows show ER, MIR, and FAR attention patterns for each task (warmer = higher activation). ER's attention degrades after Task 3 and becomes diffuse by Task 5, indicating poor retention. MIR and FAR maintain localized attention patterns across tasks, demonstrating plasticity and generalization. FAR is more concentrated, especially in Tasks 3-4, suggesting more discriminative features than MIR's broader attention.
  • ...and 1 more figures