Table of Contents
Fetching ...

Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models

Hieu Dinh Trung Pham, Huy Minh Nhat Nguyen, Cuong Tuan Nguyen

TL;DR

This work tackles the entanglement of structure and style in Vision-Language Model adaptation by introducing FARL, a Fourier-guided framework that disentangles phase (structure) and amplitude (style) information. A dual cross-attention mechanism operates on learned modality-agnostic tokens to extract structure- and style-aware representations, which are asymmetrically injected into the text and image encoders, respectively. Empirical results across 15 datasets demonstrate improved base-to-novel generalization, cross-dataset transfer, and domain generalization, with qualitative analyses confirming distinct role separation between phase- and amplitude-driven cues. The study highlights Fourier-domain disentanglement as a principled approach to enhance robustness and generalization in few-shot VLM adaptation.

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have demonstrated strong few-shot learning capabilities. However, these methods typically learn holistic representations where an image's domain-invariant structure is implicitly entangled with its domain-specific style. This presents an opportunity to further enhance generalization by disentangling these visual cues. In this paper, we propose Fourier-Attentive Representation Learning (FARL), a novel framework that addresses this by explicitly disentangling visual representations using Fourier analysis. The core of our method is a dual cross-attention mechanism, where learnable representation tokens separately query an image's structural features (from the phase spectrum) and stylistic features (from the amplitude spectrum). This process yields enriched, disentangled tokens that are then injected deep into the VLM encoders to guide adaptation. Our design, which includes an asymmetric injection strategy, forces the model to learn a more robust vision-language alignment. Extensive experiments on 15 datasets demonstrate the effectiveness of our approach.

Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models

TL;DR

This work tackles the entanglement of structure and style in Vision-Language Model adaptation by introducing FARL, a Fourier-guided framework that disentangles phase (structure) and amplitude (style) information. A dual cross-attention mechanism operates on learned modality-agnostic tokens to extract structure- and style-aware representations, which are asymmetrically injected into the text and image encoders, respectively. Empirical results across 15 datasets demonstrate improved base-to-novel generalization, cross-dataset transfer, and domain generalization, with qualitative analyses confirming distinct role separation between phase- and amplitude-driven cues. The study highlights Fourier-domain disentanglement as a principled approach to enhance robustness and generalization in few-shot VLM adaptation.

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have demonstrated strong few-shot learning capabilities. However, these methods typically learn holistic representations where an image's domain-invariant structure is implicitly entangled with its domain-specific style. This presents an opportunity to further enhance generalization by disentangling these visual cues. In this paper, we propose Fourier-Attentive Representation Learning (FARL), a novel framework that addresses this by explicitly disentangling visual representations using Fourier analysis. The core of our method is a dual cross-attention mechanism, where learnable representation tokens separately query an image's structural features (from the phase spectrum) and stylistic features (from the amplitude spectrum). This process yields enriched, disentangled tokens that are then injected deep into the VLM encoders to guide adaptation. Our design, which includes an asymmetric injection strategy, forces the model to learn a more robust vision-language alignment. Extensive experiments on 15 datasets demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 23 sections, 14 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of the harmonic mean performance between the some previous method MMRL, MMA, TCP and our proposed FARL across 11 diverse datasets for base-to-novel generalization.
  • Figure 2: Illustration of image reconstruction from the Phase and Amplitude of the Fourier transform. (a) The original image. (b) The phase-only image, retains the high-level structural features. (c) The amplitude-only image, retains low-level stylistic features.
  • Figure 3: Overview of the FARL architecture. An image is decomposed into phase (structure) and amplitude (style) components. The Fourier Fusion Attention module (Fig. \ref{['fig:cross_atn']}) uses these disentangled features to enrich learnable representation tokens $R$. Following an asymmetric injection strategy, the fused tokens are injected into the Text Encoder, while the original $R$ tokens are injected into the Image Encoder. The model is optimized with a combination of cross-entropy $\mathcal{L}_{ce}$ and cosine regularization $\mathcal{L}_{cos}$ losses. Symbols: $c$: class token, $B/E$: text boundaries, $R$: representation tokens, $F$: projection layers.
  • Figure 4: The Fourier Fusion Attention module. The module uses original representation tokens $R$ as Queries to attend to Phase and Amplitude Features as Keys/Values in parallel cross-attention blocks. The result are fused by an MLP and combined with the original $R$ via a residual connection to produce the final enriched tokens.
  • Figure 5: Visualization of Fourier decomposition and the dual-attention mechanism across diverse datasets. From left to right, each row displays: (1) the original image, (2) the phase-only reconstruction, (3) the attention map from the phase stream, (4) the amplitude-only reconstruction, and (5) the attention map from the amplitude stream.
  • ...and 2 more figures