Table of Contents
Fetching ...

Revisiting Cross-Attention Mechanisms: Leveraging Beneficial Noise for Domain-Adaptive Learning

Zelin Zang, Yehui Yang, Fei Wang, Liangyu Li, Baigui Sun

Abstract

Unsupervised Domain Adaptation (UDA) seeks to transfer knowledge from a labeled source domain to an unlabeled target domain but often suffers from severe domain and scale gaps that degrade performance. Existing cross-attention-based transformers can align features across domains, yet they struggle to preserve content semantics under large appearance and scale variations. To explicitly address these challenges, we introduce the concept of beneficial noise, which regularizes cross-attention by injecting controlled perturbations, encouraging the model to ignore style distractions and focus on content. We propose the Domain-Adaptive Cross-Scale Matching (DACSM) framework, which consists of a Domain-Adaptive Transformer (DAT) for disentangling domain-shared content from domain-specific style, and a Cross-Scale Matching (CSM) module that adaptively aligns features across multiple resolutions. DAT incorporates beneficial noise into cross-attention, enabling progressive domain translation with enhanced robustness, yielding content-consistent and style-invariant representations. Meanwhile, CSM ensures semantic consistency under scale changes. Extensive experiments on VisDA-2017, Office-Home, and DomainNet demonstrate that DACSM achieves state-of-the-art performance, with up to +2.3% improvement over CDTrans on VisDA-2017. Notably, DACSM achieves a +5.9% gain on the challenging "truck" class of VisDA, evidencing the strength of beneficial noise in handling scale discrepancies. These results highlight the effectiveness of combining domain translation, beneficial-noise-enhanced attention, and scale-aware alignment for robust cross-domain representation learning.

Revisiting Cross-Attention Mechanisms: Leveraging Beneficial Noise for Domain-Adaptive Learning

Abstract

Unsupervised Domain Adaptation (UDA) seeks to transfer knowledge from a labeled source domain to an unlabeled target domain but often suffers from severe domain and scale gaps that degrade performance. Existing cross-attention-based transformers can align features across domains, yet they struggle to preserve content semantics under large appearance and scale variations. To explicitly address these challenges, we introduce the concept of beneficial noise, which regularizes cross-attention by injecting controlled perturbations, encouraging the model to ignore style distractions and focus on content. We propose the Domain-Adaptive Cross-Scale Matching (DACSM) framework, which consists of a Domain-Adaptive Transformer (DAT) for disentangling domain-shared content from domain-specific style, and a Cross-Scale Matching (CSM) module that adaptively aligns features across multiple resolutions. DAT incorporates beneficial noise into cross-attention, enabling progressive domain translation with enhanced robustness, yielding content-consistent and style-invariant representations. Meanwhile, CSM ensures semantic consistency under scale changes. Extensive experiments on VisDA-2017, Office-Home, and DomainNet demonstrate that DACSM achieves state-of-the-art performance, with up to +2.3% improvement over CDTrans on VisDA-2017. Notably, DACSM achieves a +5.9% gain on the challenging "truck" class of VisDA, evidencing the strength of beneficial noise in handling scale discrepancies. These results highlight the effectiveness of combining domain translation, beneficial-noise-enhanced attention, and scale-aware alignment for robust cross-domain representation learning.
Paper Structure (23 sections, 7 theorems, 47 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 23 sections, 7 theorems, 47 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

Given normalized features, the attention logits $\langle \Phi_i(C), \Phi_j(S) \rangle$ are equivalent to cosine similarities.

Figures (7)

  • Figure 1: Style/appearance gap and scale gap are two frequently observed challenges in UDA. Two modules, Domain Adaptive Transformer and Cross Scale Matching, are carefully designed for the specific issues. (a), the two objects share the same structure but have different appearances. Additionally, identical semantic categories may manifest different scales in different datasets, which can also cause confusion for the model. (b) The scale gap is frequently observed in many datasets but has not received much attention. The objects may appear at different scales.
  • Figure 2: Transformer blocks with self and cross attentions. One attentional module consists of three input sources: the input feature $\bm{z}$, the query $\bm{Q}$, and the key and value pairs $\bm{K}/\bm{V}$. In self attention, $\bm{Q}$ and $\bm{K}/\bm{V}$ come from the same domain. In cross attention, $\bm{Q}$ and $\bm{K}/\bm{V}$ may come from different domains.
  • Figure 3: DACSM Framework. Domain-Adaptive Cross-Scale Matching (DACSM) integrates the Domain-Adaptive Transformer (DAT) and the Cross-Scale Matching (CSM) modules into a single end-to-end architecture. Note that the Tokenizer (Patch Embedding) is applied only once at the input stage; subsequent blocks operate on feature tokens. The DAT employs self- and cross-attention to disentangle domain-shared content (queries) from domain-specific style (keys/values), ensuring query-consistent and content-biased feature learning. The residual connection adds the Source Query stream to the attention output, ensuring dimension matching. Beneficial noise is injected into cross-attention to suppress spurious style correlations and enhance robustness. The CSM module addresses scale discrepancies by rescaling source images with multiple factors and aligning target features through a sub-center classifier. Each sub-center learns scale-specific class features in the source domain, while the target features are adaptively matched to the most compatible scale.
  • Figure 4: (Top) Illustration of a style swap operation. A 2D convolution extracts $3\times3$ patches with stride 1 and computes the normalized cross-correlations. There are $n_c=9$ spatial locations and $n_s=4$ feature channels immediately before and after the channel-wise argmax operation. A 2D transposed convolution reconstructs the full activations by placing each best-matching style patch at the corresponding spatial location. (Bottom) Style transfer results of StyleSwap and optimization-based methods. Both figures are borrowed from StyleSwap chen2016fast.
  • Figure 5: Scatter visualization of CDTrans and DACSM: The t-SNE visualization showcases the effects of domain alignment on the VisDA-2017 dataset. It is evident that CDTrans struggles to align the 'red' cluster, resulting in noisy boundaries. In contrast, this figure highlights the capability of DACSM in aligning features from the source and target domains more effectively than CDTrans.
  • ...and 2 more figures

Theorems & Definitions (18)

  • Definition 1: Normalized Patch Features
  • Definition 2: StyleSwap Matching chen2016fast
  • Definition 3: Cross-Attention
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof
  • Definition 4: Feature Encoding Stability
  • ...and 8 more