RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection
Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool
TL;DR
RICO introduces two realistic incremental learning benchmarks for object detection—D-RICO with domain shifts and a fixed class set, and EC-RICO with expanding classes and domains—constructed from 14 diverse datasets to reflect practical variations in sensors, conditions, and labeling policies. Using a ViTDet-based detector and multiple baselines, the study shows that existing IL methods struggle to balance stability and plasticity across long task sequences, with simple replay-based approaches providing strong forgetting mitigation but still not reaching individual-task training performance. Distillation-based methods perform poorly due to weak teachers across heterogeneous tasks, and a single-model IL strategy like LDB can achieve stability at the cost of plasticity, underscoring the need for more expressive, task-aware architectures. The results highlight the critical role of plasticity in real-world IL for object detection and establish D-RICO and EC-RICO as challenging benchmarks to guide future research toward more robust, scalable continual perception in varied real-world environments.
Abstract
Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability. IL must balance adaptability to new data with retention of old knowledge. However, evaluations often rely on synthetic, simplified benchmarks, obscuring real-world IL performance. To address this, we introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains and classes per IL step. Built from 14 diverse datasets covering real and synthetic domains, varying conditions (e.g., weather, time of day), camera sensors, perspectives, and labeling policies, both benchmarks capture challenges absent in existing evaluations. Our experiments show that all IL methods underperform in adaptability and retention, while replaying a small amount of previous data already outperforms all methods. However, individual training on the data remains superior. We heuristically attribute this gap to weak teachers in distillation, single models' inability to manage diverse tasks, and insufficient plasticity. Our code will be made publicly available.
