Table of Contents
Fetching ...

FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing

Hariseetharam Gunduboina, Muhammad Haris Khan, Biplab Banerjee

TL;DR

FrogDogNet tackles domain generalization in remote sensing for CLIP-based prompt learning by selectively retaining invariant low-frequency visual features through a Fourier Filter Block and guiding prompt learning with self-attention. A novel Remote Sensing Prompt Alignment Loss aligns RS-specific prompts with learned representations, while a lightweight Meta-Net converts filtered embeddings into visual tokens that augment textual prompts. The approach yields state-of-the-art performance across base-to-new, cross-dataset, and single-source multi-target generalization on four RS datasets, using a 16-shot regime and ViT-B/16 backbone. Empirical results demonstrate the effectiveness of frequency-based invariant feature retention in improving RS generalization with competitive computational efficiency. The work provides a new perspective on integrating frequency-domain filtering with prompt learning for robust domain adaptation in RS and beyond, with code available at the provided repository.

Abstract

In recent years, large-scale vision-language models (VLMs) like CLIP have gained attention for their zero-shot inference using instructional text prompts. While these models excel in general computer vision, their potential for domain generalization in remote sensing (RS) remains underexplored. Existing approaches enhance prompt learning by generating visual prompt tokens but rely on full-image features, introducing noise and background artifacts that vary within a class, causing misclassification. To address this, we propose FrogDogNet, a novel prompt learning framework integrating Fourier frequency filtering and self-attention to improve RS scene classification and domain generalization. FrogDogNet selectively retains invariant low-frequency components while eliminating noise and irrelevant backgrounds, ensuring robust feature representation across domains. The model first extracts significant features via projection and self-attention, then applies frequency-based filtering to preserve essential structural information for prompt learning. Extensive experiments on four RS datasets and three domain generalization tasks show that FrogDogNet consistently outperforms state-of-the-art prompt learning methods, demonstrating superior adaptability across domain shifts. Our findings highlight the effectiveness of frequency-based invariant feature retention in generalization, paving the way for broader applications. Our code is available at https://github.com/HariseetharamG/FrogDogNet

FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing

TL;DR

FrogDogNet tackles domain generalization in remote sensing for CLIP-based prompt learning by selectively retaining invariant low-frequency visual features through a Fourier Filter Block and guiding prompt learning with self-attention. A novel Remote Sensing Prompt Alignment Loss aligns RS-specific prompts with learned representations, while a lightweight Meta-Net converts filtered embeddings into visual tokens that augment textual prompts. The approach yields state-of-the-art performance across base-to-new, cross-dataset, and single-source multi-target generalization on four RS datasets, using a 16-shot regime and ViT-B/16 backbone. Empirical results demonstrate the effectiveness of frequency-based invariant feature retention in improving RS generalization with competitive computational efficiency. The work provides a new perspective on integrating frequency-domain filtering with prompt learning for robust domain adaptation in RS and beyond, with code available at the provided repository.

Abstract

In recent years, large-scale vision-language models (VLMs) like CLIP have gained attention for their zero-shot inference using instructional text prompts. While these models excel in general computer vision, their potential for domain generalization in remote sensing (RS) remains underexplored. Existing approaches enhance prompt learning by generating visual prompt tokens but rely on full-image features, introducing noise and background artifacts that vary within a class, causing misclassification. To address this, we propose FrogDogNet, a novel prompt learning framework integrating Fourier frequency filtering and self-attention to improve RS scene classification and domain generalization. FrogDogNet selectively retains invariant low-frequency components while eliminating noise and irrelevant backgrounds, ensuring robust feature representation across domains. The model first extracts significant features via projection and self-attention, then applies frequency-based filtering to preserve essential structural information for prompt learning. Extensive experiments on four RS datasets and three domain generalization tasks show that FrogDogNet consistently outperforms state-of-the-art prompt learning methods, demonstrating superior adaptability across domain shifts. Our findings highlight the effectiveness of frequency-based invariant feature retention in generalization, paving the way for broader applications. Our code is available at https://github.com/HariseetharamG/FrogDogNet

Paper Structure

This paper contains 14 sections, 20 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Impact of Fourier Filtering on RS Image Analysis: (a) demonstrates that retaining $50\%$ of low-frequency components (LFCs) preserves structural details while reducing noise in the frequency magnitude spectrum. (b) presents a sensitivity analysis, showing that keeping 350 out of 512 LFCs of visual features $f_v(x)$ achieves the highest average generalization performance.
  • Figure 2: Overview of FrogDogNet comprises a text encoder ($f_t$), an image encoder ($f_v$), a projection network, self-attention, and a Fourier Filter Block (FFB) for refining visual features. The image encoder extracts features, which are processed through the projection network and self-attention with residual connections, while the FFB retains key low-frequency components. The refined features pass through a lightweight Meta-Net $\{h_m\}_{m=1}^{\mathcal{M}}$ to generate visual tokens $\{\upsilon_m\}_{m=1}^{\mathcal{M}}$, which are combined with learnable text tokens $\{c_m\}_{m=1}^{\mathcal{M}}$ and class embeddings before entering the text encoder. To align the learned text prompts with remote sensing (RS) prompts, we introduce RS Prompt Alignment (RPA) loss. The model is trained using a multi-task objective, incorporating both contrastive loss and RPA loss.
  • Figure 3: The Fourier Filter Block (FFB) transforms image embeddings $X$ using Fast Fourier Transform (FFT), retains the top $k$ frequencies, and applies Inverse FFT (IFFT) to map them back to the original embedding space.
  • Figure 4: t-SNE plots van2008visualizing depicting the image features extracted from the Meta-Net of CoCoOp and the FFB of FrogDogNet for the domain generalization (DG) task on the RSICDv2 dataset. The legends indicate the corresponding class labels.
  • Figure 5: Classification performance of FrogDogNet across context lengths ($\mathcal{M}$) for the B2N generalization task on PatternNet, compared to state-of-the-art methods, using the harmonic mean (HM) of base and new class accuracies.
  • ...and 3 more figures