DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection
Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, Simone Calderara, Rita Cucchiara
TL;DR
DitHub presents a modular, version-control-inspired framework for Incremental Open-Vocabulary Object Detection that maintains a growing library of class-specific adaptation modules. By employing branch, fetch, and merge operations with a warmup-pretraining followed by specialization, plus a shared $B$ matrix for memory efficiency, it enables selective updates across tasks while preserving zero-shot capabilities. Empirically, DitHub achieves state-of-the-art results on ODinW-13 and the ODinW-O benchmark, and ablations show substantial gains from modular specialization, with strong performance even at low LoRA ranks and capabilities for targeted unlearning. This modular approach offers scalable, controllable adaptation for open-vocabulary detectors, with practical impact on cross-domain robustness and privacy-preserving model updates.
Abstract
Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/
