Table of Contents
Fetching ...

Reconstruction Target Matters in Masked Image Modeling for Cross-Domain Few-Shot Learning

Ran Ma, Yixiong Zou, Yuhua Li, Ruixuan Li

TL;DR

The paper tackles Cross-Domain Few-Shot Learning (CDFSL), showing that pixel-based Masked Autoencoder (MAE) pretraining can underperform when transferring to target domains with large gaps $D^S$ vs $D^T$. It analyzes reconstruction targets, revealing MAE's propensity to learn low-level, domain-specific information and highlighting trade-offs when using higher-level features; token-based targets can mitigate but not universally fix transferability. To address this, the authors propose Domain-Agnostic Masked Image Modeling (DAMIM) with an Aggregated Feature Reconstruction (AFR) module and a Lightweight Decoder (LD) to balance domain-agnostic information with the image's global structure while reducing decoder reliance. Across four CDFSL datasets, DAMIM achieves state-of-the-art performance, supported by ablations, feature-importance analyses, and visualization, demonstrating improved cross-domain generalization and practical applicability to few-shot transfer scenarios.

Abstract

Cross-Domain Few-Shot Learning (CDFSL) requires the model to transfer knowledge from the data-abundant source domain to data-scarce target domains for fast adaptation, where the large domain gap makes CDFSL a challenging problem. Masked Autoencoder (MAE) excels in effectively using unlabeled data and learning image's global structures, enhancing model generalization and robustness. However, in the CDFSL task with significant domain shifts, we find MAE even shows lower performance than the baseline supervised models. In this paper, we first delve into this phenomenon for an interpretation. We find that MAE tends to focus on low-level domain information during reconstructing pixels while changing the reconstruction target to token features could mitigate this problem. However, not all features are beneficial, as we then find reconstructing high-level features can hardly improve the model's transferability, indicating a trade-off between filtering domain information and preserving the image's global structure. In all, the reconstruction target matters for the CDFSL task. Based on the above findings and interpretations, we further propose Domain-Agnostic Masked Image Modeling (DAMIM) for the CDFSL task. DAMIM includes an Aggregated Feature Reconstruction module to automatically aggregate features for reconstruction, with balanced learning of domain-agnostic information and images' global structure, and a Lightweight Decoder module to further benefit the encoder's generalizability. Experiments on four CDFSL datasets demonstrate that our method achieves state-of-the-art performance.

Reconstruction Target Matters in Masked Image Modeling for Cross-Domain Few-Shot Learning

TL;DR

The paper tackles Cross-Domain Few-Shot Learning (CDFSL), showing that pixel-based Masked Autoencoder (MAE) pretraining can underperform when transferring to target domains with large gaps vs . It analyzes reconstruction targets, revealing MAE's propensity to learn low-level, domain-specific information and highlighting trade-offs when using higher-level features; token-based targets can mitigate but not universally fix transferability. To address this, the authors propose Domain-Agnostic Masked Image Modeling (DAMIM) with an Aggregated Feature Reconstruction (AFR) module and a Lightweight Decoder (LD) to balance domain-agnostic information with the image's global structure while reducing decoder reliance. Across four CDFSL datasets, DAMIM achieves state-of-the-art performance, supported by ablations, feature-importance analyses, and visualization, demonstrating improved cross-domain generalization and practical applicability to few-shot transfer scenarios.

Abstract

Cross-Domain Few-Shot Learning (CDFSL) requires the model to transfer knowledge from the data-abundant source domain to data-scarce target domains for fast adaptation, where the large domain gap makes CDFSL a challenging problem. Masked Autoencoder (MAE) excels in effectively using unlabeled data and learning image's global structures, enhancing model generalization and robustness. However, in the CDFSL task with significant domain shifts, we find MAE even shows lower performance than the baseline supervised models. In this paper, we first delve into this phenomenon for an interpretation. We find that MAE tends to focus on low-level domain information during reconstructing pixels while changing the reconstruction target to token features could mitigate this problem. However, not all features are beneficial, as we then find reconstructing high-level features can hardly improve the model's transferability, indicating a trade-off between filtering domain information and preserving the image's global structure. In all, the reconstruction target matters for the CDFSL task. Based on the above findings and interpretations, we further propose Domain-Agnostic Masked Image Modeling (DAMIM) for the CDFSL task. DAMIM includes an Aggregated Feature Reconstruction module to automatically aggregate features for reconstruction, with balanced learning of domain-agnostic information and images' global structure, and a Lightweight Decoder module to further benefit the encoder's generalizability. Experiments on four CDFSL datasets demonstrate that our method achieves state-of-the-art performance.
Paper Structure (25 sections, 13 equations, 10 figures, 6 tables)

This paper contains 25 sections, 13 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: (a) The ratio of the accuracy of the supervised ViT, MAE, and our method on four CDFSL datasets, where we can see MAE underperforms on these datasets. (b) The average performance of the supervised ViT, MAE, iBOT, and our method, inspires us to think about the role of reconstruction target in MIM for CDFSL.
  • Figure 2: MAE reconstructs masked patches using an autoencoder, targeting raw pixels for reconstruction.
  • Figure 3: (a) Reconstruction loss measured by using different layer features in ViT as reconstruction targets. The reconstruction loss of shallow-layer features is lower, indicating it is easier for the model to capture and learn low-level features. (b) Domain similarity of the final features between the source and target domains after disrupting features in different layers. Disrupting shallow-layer features leads to a higher domain similarity.
  • Figure 4: Domain similarity of the models using different layer features in ViT as reconstruction targets. Shallow-layer features show lower domain similarity, while reconstructing deeper-layer features can hardly improve the model's transferability, indicating a trade-off between filtering domain information and preserving the image's global structure.
  • Figure 5: Shallow, middle, and deep features visualization of MAE. In shallow layers, the model predominantly captures low-level information. As the network goes deeper, its focus gradually shifts towards semantic parts in images.
  • ...and 5 more figures