Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection

Jieren Deng; Haojian Zhang; Kun Ding; Jianhua Hu; Xingxuan Zhang; Yunkuan Wang

Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection

Jieren Deng, Haojian Zhang, Kun Ding, Jianhua Hu, Xingxuan Zhang, Yunkuan Wang

TL;DR

The Zero-interference Reparameterizable Adaptation (ZiRa), a novel method that introduces Zero-interference Loss and reparameterization techniques to tackle IVLOD without incurring additional inference costs or a significant increase in memory usage, is presented.

Abstract

This paper presents Incremental Vision-Language Object Detection (IVLOD), a novel learning task designed to incrementally adapt pre-trained Vision-Language Object Detection Models (VLODMs) to various specialized domains, while simultaneously preserving their zero-shot generalization capabilities for the generalized domain. To address this new challenge, we present the Zero-interference Reparameterizable Adaptation (ZiRa), a novel method that introduces Zero-interference Loss and reparameterization techniques to tackle IVLOD without incurring additional inference costs or a significant increase in memory usage. Comprehensive experiments on COCO and ODinW-13 datasets demonstrate that ZiRa effectively safeguards the zero-shot generalization ability of VLODMs while continuously adapting to new tasks. Specifically, after training on ODinW-13 datasets, ZiRa exhibits superior performance compared to CL-DETR and iDETR, boosting zero-shot generalizability by substantial 13.91 and 8.74 AP, respectively.Our code is available at https://github.com/JarintotionDin/ZiRaGroundingDINO.

Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection

TL;DR

Abstract

Paper Structure (11 sections, 8 equations, 6 figures, 9 tables)

This paper contains 11 sections, 8 equations, 6 figures, 9 tables.

Introduction
Related Work
Methodology
Overview
Reparameterizable Dual Branch
Zero-interference Loss
Experiments
Setup
Comparison with Existing Methods
Ablation Study
Conclusion

Figures (6)

Figure 1: Incremental Vision-Language Object Detection (IVLOD) aims to enhance VLODMs' performance across specialized domains via incremental learning, while also preserving their zero-shot generalization capability, enabling them to handle both known and unknown objects simultaneously and effectively.
Figure 2: Our framework, features two Reparameterizable Dual Branch with Zero-interference Loss on both the vision and language sides.
Figure 3: The structure of the Reparameterizable Dual Branch (RDB).
Figure 4: The performance of the pre-trained VLODM with different levels of Gaussian noise added to the input of VLODM's detector.
Figure 5: The average $L_1$ norm curve of the RDB's output overall sequentially learned downstream tasks, computing the output norm on both language and vision sides. The longitudinal axis is logarithmically scaled for better visualization.
...and 1 more figures

Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection

TL;DR

Abstract

Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)