Table of Contents
Fetching ...

Visually Similar Pair Alignment for Robust Cross-Domain Object Detection

Onkar Krishna, Hiroki Ohashi

TL;DR

This work tackles the challenge of unsupervised domain adaptation for object detection by showing that aligning visually similar pairs across domains—instead of pairing arbitrary source and target instances—improves transfer. It introduces a memory-augmented framework with separate foreground and background memories that retrieve visually similar source features for alignment, coupled with a targeted foreground triplet-like alignment and a background adversarial module. Across adverse weather, synthetic-to-real, and real-to-artistic shifts, the approach achieves state-of-the-art results (e.g., 53.1% mAP on Foggy Cityscapes and 62.3% mAP on Sim10k) and demonstrates the benefits of memory-based, visually aware domain alignment. The work also provides a customized cross-domain dataset with controlled visual attributes and analyzes memory design choices, demonstrating robust gains and practical efficiency improvements through memory subsampling.

Abstract

Domain gaps between training data (source) and real-world environments (target) often degrade the performance of object detection models. Most existing methods aim to bridge this gap by aligning features across source and target domains but often fail to account for visual differences, such as color or orientation, in alignment pairs. This limitation leads to less effective domain adaptation, as the model struggles to manage both domain-specific shifts (e.g., fog) and visual variations simultaneously. In this work, we demonstrate for the first time, using a custom-built dataset, that aligning visually similar pairs significantly improves domain adaptation. Based on this insight, we propose a novel memory-based system to enhance domain alignment. This system stores precomputed features of foreground objects and background areas from the source domain, which are periodically updated during training. By retrieving visually similar source features for alignment with target foreground and background features, the model effectively addresses domain-specific differences while reducing the impact of visual variations. Extensive experiments across diverse domain shift scenarios validate our method's effectiveness, achieving 53.1 mAP on Foggy Cityscapes and 62.3 on Sim10k, surpassing prior state-of-the-art methods by 1.2 and 4.1 mAP, respectively.

Visually Similar Pair Alignment for Robust Cross-Domain Object Detection

TL;DR

This work tackles the challenge of unsupervised domain adaptation for object detection by showing that aligning visually similar pairs across domains—instead of pairing arbitrary source and target instances—improves transfer. It introduces a memory-augmented framework with separate foreground and background memories that retrieve visually similar source features for alignment, coupled with a targeted foreground triplet-like alignment and a background adversarial module. Across adverse weather, synthetic-to-real, and real-to-artistic shifts, the approach achieves state-of-the-art results (e.g., 53.1% mAP on Foggy Cityscapes and 62.3% mAP on Sim10k) and demonstrates the benefits of memory-based, visually aware domain alignment. The work also provides a customized cross-domain dataset with controlled visual attributes and analyzes memory design choices, demonstrating robust gains and practical efficiency improvements through memory subsampling.

Abstract

Domain gaps between training data (source) and real-world environments (target) often degrade the performance of object detection models. Most existing methods aim to bridge this gap by aligning features across source and target domains but often fail to account for visual differences, such as color or orientation, in alignment pairs. This limitation leads to less effective domain adaptation, as the model struggles to manage both domain-specific shifts (e.g., fog) and visual variations simultaneously. In this work, we demonstrate for the first time, using a custom-built dataset, that aligning visually similar pairs significantly improves domain adaptation. Based on this insight, we propose a novel memory-based system to enhance domain alignment. This system stores precomputed features of foreground objects and background areas from the source domain, which are periodically updated during training. By retrieving visually similar source features for alignment with target foreground and background features, the model effectively addresses domain-specific differences while reducing the impact of visual variations. Extensive experiments across diverse domain shift scenarios validate our method's effectiveness, achieving 53.1 mAP on Foggy Cityscapes and 62.3 on Sim10k, surpassing prior state-of-the-art methods by 1.2 and 4.1 mAP, respectively.

Paper Structure

This paper contains 17 sections, 15 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: (a) To validate our hypothesis, we introduce a new cross-domain dataset, AugSim10k → FoggyAugSim10k. The source dataset is created by augmenting Sim10k, applying new visual attributes such as color and orientation exclusively to the labeled objects. The target dataset is generated by applying fixed-intensity fog to the augmented source images. (b) Detection accuracy on the target dataset is compared using models trained with different instance alignment schemes. The results demonstrate that aligning visually similar pairs, differing only in domain characteristics (e.g., fog), significantly outperforms alignment of pairs with variations in color or orientation.
  • Figure 2: Network Overview: Mainly consists of a memory module, a visual similarity-based foreground and background domain alignment module.
  • Figure 3: Impact of the proposed memory-based alignment module on detection accuracy is demonstrated. By enabling visually similar matches across batches, the memory-based approach enhances domain alignment, achieving a 4.6% performance improvement over the best-performing non-memory-based C2C method.
  • Figure 4: Performance comparison of alignment strategies: foreground-only, background-only, and combined alignment. The results demonstrate that aligning both foreground and background yields the best performance.
  • Figure 5: Comparison of subsampling methods for memory banks ($\mathcal{M}_\text{bg}$ and $\mathcal{M}_\text{fg}$): greedy coreset selection versus random subsampling.
  • ...and 2 more figures