Table of Contents
Fetching ...

Long-Tailed Object Detection Pre-training: Dynamic Rebalancing Contrastive Learning with Dual Reconstruction

Chen-Long Duan, Yong Li, Xiu-Shen Wei, Lin Zhao

TL;DR

A novel pre-training framework for object detection, called Dynamic Rebalancing Contrastive Learning with Dual Reconstruction (2DRCL), which builds on a Holistic-Local Contrastive Learning mechanism, which aligns pre-training with object detection by capturing both global contextual semantics and detailed local patterns.

Abstract

Pre-training plays a vital role in various vision tasks, such as object recognition and detection. Commonly used pre-training methods, which typically rely on randomized approaches like uniform or Gaussian distributions to initialize model parameters, often fall short when confronted with long-tailed distributions, especially in detection tasks. This is largely due to extreme data imbalance and the issue of simplicity bias. In this paper, we introduce a novel pre-training framework for object detection, called Dynamic Rebalancing Contrastive Learning with Dual Reconstruction (2DRCL). Our method builds on a Holistic-Local Contrastive Learning mechanism, which aligns pre-training with object detection by capturing both global contextual semantics and detailed local patterns. To tackle the imbalance inherent in long-tailed data, we design a dynamic rebalancing strategy that adjusts the sampling of underrepresented instances throughout the pre-training process, ensuring better representation of tail classes. Moreover, Dual Reconstruction addresses simplicity bias by enforcing a reconstruction task aligned with the self-consistency principle, specifically benefiting underrepresented tail classes. Experiments on COCO and LVIS v1.0 datasets demonstrate the effectiveness of our method, particularly in improving the mAP/AP scores for tail classes.

Long-Tailed Object Detection Pre-training: Dynamic Rebalancing Contrastive Learning with Dual Reconstruction

TL;DR

A novel pre-training framework for object detection, called Dynamic Rebalancing Contrastive Learning with Dual Reconstruction (2DRCL), which builds on a Holistic-Local Contrastive Learning mechanism, which aligns pre-training with object detection by capturing both global contextual semantics and detailed local patterns.

Abstract

Pre-training plays a vital role in various vision tasks, such as object recognition and detection. Commonly used pre-training methods, which typically rely on randomized approaches like uniform or Gaussian distributions to initialize model parameters, often fall short when confronted with long-tailed distributions, especially in detection tasks. This is largely due to extreme data imbalance and the issue of simplicity bias. In this paper, we introduce a novel pre-training framework for object detection, called Dynamic Rebalancing Contrastive Learning with Dual Reconstruction (2DRCL). Our method builds on a Holistic-Local Contrastive Learning mechanism, which aligns pre-training with object detection by capturing both global contextual semantics and detailed local patterns. To tackle the imbalance inherent in long-tailed data, we design a dynamic rebalancing strategy that adjusts the sampling of underrepresented instances throughout the pre-training process, ensuring better representation of tail classes. Moreover, Dual Reconstruction addresses simplicity bias by enforcing a reconstruction task aligned with the self-consistency principle, specifically benefiting underrepresented tail classes. Experiments on COCO and LVIS v1.0 datasets demonstrate the effectiveness of our method, particularly in improving the mAP/AP scores for tail classes.

Paper Structure

This paper contains 39 sections, 8 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Illustration of the proposed Dynamic Rebalancing Contrastive Learning with Dual Reconstruction (2DRCL) method, which consists of the Holistic Contrastive Learning (Section \ref{['HCL']}), the Local Contrastive Learning (Section \ref{['LCL']}), and the Dual Reconstruction (Section \ref{['Dual Reconstruction']}). The whole network can be trained in an end-to-end manner.
  • Figure 2: Error analyses comparisons. 2DRCL achieves superior performance on tail classes without significantly compromising accuracy for the more frequent classes.
  • Figure 3: Attention map comparisons w.r.t Baseline lvis, ECM ecm, 2DRCL (w/o DRC) and 2DRCL (our method) on LVIS dataset. The top row shows the corresponding class names of the input images. Best viewed in color.
  • Figure A.1: Visualizations of detection results before (in the left of each group) and after (in the right) using our 2DRCL. We adopted RFS lvis as the baseline in LVIS and combined it with our 2DRCL pre-training method. In comparison, the proposed method is good at detecting missing objects and rectifying bounding box predictions. This figure needs to be viewed in color.
  • Figure A.2: (a) and (b) are classifiers’ weight norm distribution across different classes in Mask R-CNN models trained with the LVIS v1.0 training split lvis. The X-axis represents the sorted category index based on category frequency. The Y-axis shows the weight norm. Transparent lines depict the actual weight norms for each category, providing a raw look at the data distribution. The solid lines represent polynomial curves fitted to the transparent data, offering a smoothed interpretation of trends across classes. (a) represents the comparison with the state-of-the-art methods, while (b) represents the comparison with the proposed components in this paper. (c) represents the average weight norm of the classifiers for each frequency category.