Table of Contents
Fetching ...

PAGen: Phase-guided Amplitude Generation for Domain-adaptive Object Detection

Shuchen Du, Shuo Lei, Feiran Li, Jiacheng Li, Daisuke Iso

TL;DR

This work tackles unsupervised domain adaptation for object detection by introducing PAGen, a lightweight phase-guided amplitude generation module that operates in the frequency domain as a training-time preprocessing step. PAGen preserves content while transferring target-domain styles to source images and is discarded at inference, ensuring zero overhead during deployment. It couples this spectral transfer with a feature alignment loss to learn more robust, domain-invariant features, and demonstrates consistent gains across benchmarks such as Cityscapes→Foggy Cityscapes, Sim10k→Cityscapes, BDD Night, and Cityscapes→ACDC. The results highlight the practicality of a simple, learnable, input-level adaptation approach that avoids multi-stage training or auxiliary networks while delivering strong performance improvements.

Abstract

Unsupervised domain adaptation (UDA) greatly facilitates the deployment of neural networks across diverse environments. However, most state-of-the-art approaches are overly complex, relying on challenging adversarial training strategies, or on elaborate architectural designs with auxiliary models for feature distillation and pseudo-label generation. In this work, we present a simple yet effective UDA method that learns to adapt image styles in the frequency domain to reduce the discrepancy between source and target domains. The proposed approach introduces only a lightweight pre-processing module during training and entirely discards it at inference time, thus incurring no additional computational overhead. We validate our method on domain-adaptive object detection (DAOD) tasks, where ground-truth annotations are easily accessible in source domains (e.g., normal-weather or synthetic conditions) but challenging to obtain in target domains (e.g., adverse weather or low-light scenes). Extensive experiments demonstrate that our method achieves substantial performance gains on multiple benchmarks, highlighting its practicality and effectiveness.

PAGen: Phase-guided Amplitude Generation for Domain-adaptive Object Detection

TL;DR

This work tackles unsupervised domain adaptation for object detection by introducing PAGen, a lightweight phase-guided amplitude generation module that operates in the frequency domain as a training-time preprocessing step. PAGen preserves content while transferring target-domain styles to source images and is discarded at inference, ensuring zero overhead during deployment. It couples this spectral transfer with a feature alignment loss to learn more robust, domain-invariant features, and demonstrates consistent gains across benchmarks such as Cityscapes→Foggy Cityscapes, Sim10k→Cityscapes, BDD Night, and Cityscapes→ACDC. The results highlight the practicality of a simple, learnable, input-level adaptation approach that avoids multi-stage training or auxiliary networks while delivering strong performance improvements.

Abstract

Unsupervised domain adaptation (UDA) greatly facilitates the deployment of neural networks across diverse environments. However, most state-of-the-art approaches are overly complex, relying on challenging adversarial training strategies, or on elaborate architectural designs with auxiliary models for feature distillation and pseudo-label generation. In this work, we present a simple yet effective UDA method that learns to adapt image styles in the frequency domain to reduce the discrepancy between source and target domains. The proposed approach introduces only a lightweight pre-processing module during training and entirely discards it at inference time, thus incurring no additional computational overhead. We validate our method on domain-adaptive object detection (DAOD) tasks, where ground-truth annotations are easily accessible in source domains (e.g., normal-weather or synthetic conditions) but challenging to obtain in target domains (e.g., adverse weather or low-light scenes). Extensive experiments demonstrate that our method achieves substantial performance gains on multiple benchmarks, highlighting its practicality and effectiveness.

Paper Structure

This paper contains 24 sections, 14 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: (a) Recent state-of-the-art approaches mainly rely on self-training strategies to generate pseudo labels for target-domain images. Their training procedures are typically complex, often requiring multi-stage optimization and careful balancing among multiple loss terms. In contrast, our method enables a simple end-to-end one-step training pipeline, which greatly facilitates practical deployment. (b) In terms of target-domain data quality, existing methods heavily rely on pseudo labels for consistency training, which can introduce noise due to missed detections. Our approach, however, ensures accurate target supervision without such noise, thus reducing misleading gradients during adaptation.
  • Figure 2: Image decomposition in the frequency domain allows separating content and style components for cross-image recombination.
  • Figure 3: (a) Overview of our approach. The proposed PAGen is employed only during training as a preprocessing component for the detector. Given a source image and its style-altered PAGen output, we feed both into the detector and enforce feature alignment within the backbone’s latent space. This alignment is optimized jointly with the detector’s standard training objectives to learn the full pipeline. During inference time, PAGen is entirely discarded, and target-domain images are directly fed into the detector. (b) Architecture of PAGen. PAGen follows a patch-wise cross-attention paradigm. We transform source image $I_s$ and target image $I_t$ into frequency domain, use the source phase as the query branch, and construct key–value features by concatenating amplitude representations from both domains along the spatial dimension. Cross-attention yields an adapted amplitude, which is combined with the source phase and converted back via iDFT to produce the PAGen-adapted image.
  • Figure 4: Examples of detection results on each weather split of the ACDC dataset.
  • Figure 5: t-SNE visualization of embedding features from ACDC (Snow) and Cityscapes, illustraing their domain discrepancy and the representational structure learned by the model.