Table of Contents
Fetching ...

Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

Yongwei Jiang, Yixiong Zou, Yuhua Li, Ruixuan Li

Abstract

Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.

Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

Abstract

Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.
Paper Structure (27 sections, 15 equations, 9 figures, 6 tables)

This paper contains 27 sections, 15 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Visualization and quantification of the target-domain Astigmatism problem. Top: Attention maps measured across transformer blocks show that, in source domains (first row), attention progressively focuses on foreground objects, while in target domains (second row), attention remains persistently dispersed, resulting in oversized boxes and redundant predictions in object detection. Bottom: Attention distance across network depth (SxBy denotes Stage x, Block y in the Swin Transformer) reveals: (1) a observed rise-then-fall trend in attention distance, reflecting an initial broad attention that gradually concentrates on objects for precise localization; (2) consistently higher attention dispersion in target domains compared to source domains; and (3) regular fine-tuning only marginally reduces this attention dispersion.
  • Figure 2: The inspiration from human fovea-style vision to remedy the Astigmatism problem. Our method mimics the human visual system: the Core Perception Zone (green) with high-detail processing guides the Positive Pattern Refinement (PPR) module to reshape attention more effectively toward foreground objects, while the Peripheral Zone (orange) with reduced details informs the Negative Context Modulation (NCM) module to enhance object-background boundaries by modeling background contexts. The Textual Semantic Alignment (TSA) module enforces the distinctions between center and peripheral regions, analogous to the center-surround mechanism of biological perception.
  • Figure 3: Overview of our human-vision-inspired framework to remedy the Astigmatism problem in CD-FSOD. The architecture integrates three complementary modules: (1) Positive Pattern Refinement (PPR) reshapes attention toward foreground objects using class prototypes; (2) Negative Context Modulation (NCM) enhances object-background boundaries through explicit background modeling; and (3) Textual Semantic Alignment (TSA) enhances these distinctions via cross-modal knowledge integration with negative descriptors ("not [class]"). During training (top), the model is optimized with both detection and alignment objectives, and extracts discriminative prototypes from support examples, stored in positive and negative repositories. At inference (bottom), stored prototypes turn dispersed attention patterns in query images into crystallized object-centric representations, analogous to the human center-peripheral visual system.
  • Figure 4: Attention distribution around a foreground patch $x_0$. Dashed arrows labeled $A_k$ indicate attention from $x_0$ to neighbors $x_k$. Same‑object neighbors $x_2,x_3$ are spatially near and should receive high attention, whereas the background patch $x_1$ is distant and should be weak. In target‑domain Astigmatism, domain shift diverts attention toward the distant background ($A_1\uparrow$; $A_2,A_3\downarrow$), yielding dispersed attention and a larger attention distance. Since the spatial distance between patches are fixed, our method reverses this dispersion by (i) down‑weighting background responses $A_1$ via a learned background prototype and simple "not [class]" cues (NCM/TSA module) to suppress spurious foreground–background affinity, and (ii) up‑weighting same‑object responses $A_2,A_3$ via class‑specific foreground prototypes (PPR module) to strengthen intra‑object compatibility, thereby shortening attention distance and restoring a focused, object‑centric pattern.
  • Figure 5: Top: Visualization of attention maps demonstrating the Astigmatism problem and our solution. Each row displays two different samples: including the original image with ground truth (left), conventional fine-tuning that only marginally alleviates dispersed attention patterns (middle), and ours, which indicates focused attention that precisely concentrates on target regions. Bottom: Change in attention distance relative to pretrained model across datasets. Negative values indicate reduction in attention dispersion (larger magnitude means better focus). While fine-tuning reduces dispersion, our method achieves substantially greater reductions, validating our superior ability to address the Astigmatism problem by reshaping attention toward foreground objects.
  • ...and 4 more figures