Table of Contents
Fetching ...

Layer-Wise Feature Metric of Semantic-Pixel Matching for Few-Shot Learning

Hao Tang, Junhao Lu, Guoheng Huang, Ming Li, Xuhang Chen, Guo Zhong, Zhengguang Tan, Zinuo Li

TL;DR

A novel method called the Layer-Wise Features Metric of Semantic-Pixel Matching (LWFM-SPM) to make finer comparisons on few-shot classification benchmarks and results indicate that LWFM-SPM achieves competitive performance across these benchmarks.

Abstract

In Few-Shot Learning (FSL), traditional metric-based approaches often rely on global metrics to compute similarity. However, in natural scenes, the spatial arrangement of key instances is often inconsistent across images. This spatial misalignment can result in mismatched semantic pixels, leading to inaccurate similarity measurements. To address this issue, we propose a novel method called the Layer-Wise Features Metric of Semantic-Pixel Matching (LWFM-SPM) to make finer comparisons. Our method enhances model performance through two key modules: (1) the Layer-Wise Embedding (LWE) Module, which refines the cross-correlation of image pairs to generate well-focused feature maps for each layer; (2)the Semantic-Pixel Matching (SPM) Module, which aligns critical pixels based on semantic embeddings using an assignment algorithm. We conducted extensive experiments to evaluate our method on four widely used few-shot classification benchmarks: miniImageNet, tieredImageNet, CUB-200-2011, and CIFAR-FS. The results indicate that LWFM-SPM achieves competitive performance across these benchmarks. Our code will be publicly available on https://github.com/Halo2Tang/Code-for-LWFM-SPM.

Layer-Wise Feature Metric of Semantic-Pixel Matching for Few-Shot Learning

TL;DR

A novel method called the Layer-Wise Features Metric of Semantic-Pixel Matching (LWFM-SPM) to make finer comparisons on few-shot classification benchmarks and results indicate that LWFM-SPM achieves competitive performance across these benchmarks.

Abstract

In Few-Shot Learning (FSL), traditional metric-based approaches often rely on global metrics to compute similarity. However, in natural scenes, the spatial arrangement of key instances is often inconsistent across images. This spatial misalignment can result in mismatched semantic pixels, leading to inaccurate similarity measurements. To address this issue, we propose a novel method called the Layer-Wise Features Metric of Semantic-Pixel Matching (LWFM-SPM) to make finer comparisons. Our method enhances model performance through two key modules: (1) the Layer-Wise Embedding (LWE) Module, which refines the cross-correlation of image pairs to generate well-focused feature maps for each layer; (2)the Semantic-Pixel Matching (SPM) Module, which aligns critical pixels based on semantic embeddings using an assignment algorithm. We conducted extensive experiments to evaluate our method on four widely used few-shot classification benchmarks: miniImageNet, tieredImageNet, CUB-200-2011, and CIFAR-FS. The results indicate that LWFM-SPM achieves competitive performance across these benchmarks. Our code will be publicly available on https://github.com/Halo2Tang/Code-for-LWFM-SPM.

Paper Structure

This paper contains 17 sections, 15 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: The key difference between our method and previous approaches lies in our layer-wise embedding computation and the way similarity is computed between query and support images. As shown in (a), Previous work typically uses CNNs or Transformers to generate single-layer image embeddings; however, a single-layer embedding may not effectively integrate complex semantic information. This feature is then used to compute a single-layer correlation map, which is applied to reweight the image features. In contrast, as shown in (b), our method integrates multi-layer outputs from the backbone to form multi-level correlation maps. We then compute layer-wise weights for the features at different levels, creating an image embedding that captures diverse semantic focuses across different levels while avoiding the complexity of CNNs and Transformers. In (c), prior methods typically calculate the similarity between corresponding pixels at the same locations in both images. However, this approach overlooks the possibility that semantically similar pixels may be located in different positions, making it difficult to assess the true similarity between the image pairs accurately. In contrast, (d) our method employs a matching algorithm that identifies the most similar pixel in the support image for each pixel in the query image, even if their positions do not align perfectly. This allows for a more accurate evaluation of the true similarity score.
  • Figure 2: Overview of proposed LWFM-SPM. The method consists of two stages, following a coarse-to-fine approach. First, Layer-wise Embedding (LWE) is used to generate multi-level correlation maps, producing well-focused semantic maps from the image pair. The Semantic-Pixel Matching (SPM) then provides fine-grained metrics for classification by reassigning semantic pixels between the feature maps of image pairs at each layer, using an assignment algorithm.
  • Figure 3: The overview of our proposed Hungarian matching algorithm consists of two stages: calculating the cost matrix to find the best matching pixels and rearranging the pixels. Given the input $E_s$ and $E_q$, we first compute a similarity matrix (i.e., cost matrix) between every pair of pixels. Using the Hungarian algorithm, we identify the combination that maximizes global similarity. Then, $E_q$ is rearranged according to the matching results, ensuring that similar pixels are aligned. This allows us to accurately compute the true similarity between the two images in subsequent steps.
  • Figure 4: The distribution of support and query set embeddings in the 5-way 1-shot task. "·" represents the support set and "X" represents the query set. With the integration of SPM, the query set gets closer to the support set, distinctions between different classes become more pronounced, and misclassifications are reduced.