Table of Contents
Fetching ...

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

Yaze Zhao, Yixiong Zou, Yuhua Li, Ruixuan Li

Abstract

Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

Abstract

Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.
Paper Structure (24 sections, 17 equations, 19 figures, 11 tables)

This paper contains 24 sections, 17 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: Fine-grained visual cues (marked as red boxes) are crucial in specialized fields like interpretable medical diagnosis (a). However, in target domains, the fine-tuned CLIP cannot focus on these subtle patterns (b), but in source domains (c), CLIP can still roughly capture all important regions for recognition. Therefore, we hypothesize that the domain gap and scarce training data exacerbate CLIP's shortcomings in capturing subtle patterns, much more than that of holistic patterns, which we aim to address.
  • Figure 2: (a) To validate our hypothesis, we measure the alignment score between global / local features and text features, and we find the alignments of both global and local features are harmed under the CDFSL task, but the local features show a larger decline in the alignment score, verifying our hypothesis. (b) Our proposed method effectively improves the local alignment in target domains.
  • Figure 3: Overview of our framework, consisting of three key components: (a) the Text-to-Image-to-Text (T-I-T) cycle-consistency module, (b) the Semantic Anchor (SA) module, and (c) the Image-to-Text-to-Image (I-T-I) cycle-consistency module. The process begins with the SA module augmenting raw images to create a larger corpus, followed by extracting local image features from these images and transforming them via MLP to align with the text feature space. The T-I-T cycle then uses text features to select semantically relevant patches and maps them back to reconstruct text features, enhancing local feature alignment. Subsequently, the SA module shrinks the feature set to select class-relevant anchor patches, which are used in the I-T-I cycle to map these anchor visual features through text features to augmented image features. Our model improves local alignment and interpretability in cross-domain few-shot learning.
  • Figure 4: Base-to-new generalization on 11 datasets.
  • Figure 5: Ablation study on hybrid coefficient in Eq. \ref{['eq:loss']}.
  • ...and 14 more figures