Table of Contents
Fetching ...

FOUND: Fourier-based von Mises Distribution for Robust Single Domain Generalization in Object Detection

Mengzhu Wang, Changyuan Deng, Shanshan Wang, Nan Yin, Long Lan, Liang Yang

TL;DR

This work tackles single-domain generalization for object detection by integrating frequency-domain perturbations with hyperspherical feature regularization in a CLIP-guided framework. It introduces Probabilistic Fourier Augmentation (PFA) to diversify appearance while preserving semantic structure, and von Mises-Fisher (vMF) regularization to maintain semantically coherent, compact feature spaces. The method leverages CLIP-based target semantics to guide domain shifts via a semantic shift vector $\Delta q$, and optimizes a combined loss $\mathcal{L}_{total} = \mathcal{L}_{det} + \lambda_{vMF} \mathcal{L}_{vMF}$ to balance robustness and discriminability. Experiments on a challenging adverse-weather driving benchmark show state-of-the-art cross-domain generalization, with notable gains in night/rainy and dusk/rainy conditions, validating the synergy between frequency-domain augmentation and hypersphere-regularized representations for robust SDG in object detection.

Abstract

Single Domain Generalization (SDG) for object detection aims to train a model on a single source domain that can generalize effectively to unseen target domains. While recent methods like CLIP-based semantic augmentation have shown promise, they often overlook the underlying structure of feature distributions and frequency-domain characteristics that are critical for robustness. In this paper, we propose a novel framework that enhances SDG object detection by integrating the von Mises-Fisher (vMF) distribution and Fourier transformation into a CLIP-guided pipeline. Specifically, we model the directional features of object representations using vMF to better capture domain-invariant semantic structures in the embedding space. Additionally, we introduce a Fourier-based augmentation strategy that perturbs amplitude and phase components to simulate domain shifts in the frequency domain, further improving feature robustness. Our method not only preserves the semantic alignment benefits of CLIP but also enriches feature diversity and structural consistency across domains. Extensive experiments on the diverse weather-driving benchmark demonstrate that our approach outperforms the existing state-of-the-art method.

FOUND: Fourier-based von Mises Distribution for Robust Single Domain Generalization in Object Detection

TL;DR

This work tackles single-domain generalization for object detection by integrating frequency-domain perturbations with hyperspherical feature regularization in a CLIP-guided framework. It introduces Probabilistic Fourier Augmentation (PFA) to diversify appearance while preserving semantic structure, and von Mises-Fisher (vMF) regularization to maintain semantically coherent, compact feature spaces. The method leverages CLIP-based target semantics to guide domain shifts via a semantic shift vector , and optimizes a combined loss to balance robustness and discriminability. Experiments on a challenging adverse-weather driving benchmark show state-of-the-art cross-domain generalization, with notable gains in night/rainy and dusk/rainy conditions, validating the synergy between frequency-domain augmentation and hypersphere-regularized representations for robust SDG in object detection.

Abstract

Single Domain Generalization (SDG) for object detection aims to train a model on a single source domain that can generalize effectively to unseen target domains. While recent methods like CLIP-based semantic augmentation have shown promise, they often overlook the underlying structure of feature distributions and frequency-domain characteristics that are critical for robustness. In this paper, we propose a novel framework that enhances SDG object detection by integrating the von Mises-Fisher (vMF) distribution and Fourier transformation into a CLIP-guided pipeline. Specifically, we model the directional features of object representations using vMF to better capture domain-invariant semantic structures in the embedding space. Additionally, we introduce a Fourier-based augmentation strategy that perturbs amplitude and phase components to simulate domain shifts in the frequency domain, further improving feature robustness. Our method not only preserves the semantic alignment benefits of CLIP but also enriches feature diversity and structural consistency across domains. Extensive experiments on the diverse weather-driving benchmark demonstrate that our approach outperforms the existing state-of-the-art method.

Paper Structure

This paper contains 14 sections, 5 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overall framework the proposed FOUND, the PFA module facilitates the learning of domain-invariant features, while the vMF module maintains semantic consistency and preserves diverse visual styles.
  • Figure 2: Visualization of the feature spaces for (a) CLIP the GAP and (b) FOUND. Compared to the entangled feature distribution of the baseline, FOUND learns a more separable and semantically aligned representation.
  • Figure 3: The baseline's feature map (b) is diffuse and distracted by background noise. In contrast, our FOUND model (c) generates a sharp activation map tightly focused on the target object.