Table of Contents
Fetching ...

Semantic Correspondence: Unified Benchmarking and a Strong Baseline

Kaiyan Zhang, Xinghui Li, Jingyi Lu, Kai Han

TL;DR

This work provides a holistic survey and benchmarking framework for semantic correspondence, classifying methods into handcrafted descriptors, architectural improvements, and training strategies. It demonstrates that backbone quality and fine-tuning are the dominant factors in performance, and shows that a simple baseline combining strong backbones with targeted refinement achieves state-of-the-art results on multiple benchmarks. The study offers extensive controlled experiments across datasets and resolutions, and proposes a unified benchmark to enable fair comparisons. The findings emphasize backbone-driven gains and point to future directions in foundation-model adaptation and more scalable supervisory signals for semantic matching.

Abstract

Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding on existing methods for semantic matching, we thoroughly conduct controlled experiments to analyse the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development. Code is publicly available at: https://github.com/Visual-AI/Semantic-Correspondence.

Semantic Correspondence: Unified Benchmarking and a Strong Baseline

TL;DR

This work provides a holistic survey and benchmarking framework for semantic correspondence, classifying methods into handcrafted descriptors, architectural improvements, and training strategies. It demonstrates that backbone quality and fine-tuning are the dominant factors in performance, and shows that a simple baseline combining strong backbones with targeted refinement achieves state-of-the-art results on multiple benchmarks. The study offers extensive controlled experiments across datasets and resolutions, and proposes a unified benchmark to enable fair comparisons. The findings emphasize backbone-driven gains and point to future directions in foundation-model adaptation and more scalable supervisory signals for semantic matching.

Abstract

Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding on existing methods for semantic matching, we thoroughly conduct controlled experiments to analyse the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development. Code is publicly available at: https://github.com/Visual-AI/Semantic-Correspondence.

Paper Structure

This paper contains 29 sections, 1 equation, 9 figures, 14 tables.

Figures (9)

  • Figure 1: The development timeline of semantic correspondence methods categorized by their levels of supervision. These categories include strongly supervised, weakly supervised, and zero-shot methods. Each method is labeled with its backbone architecture, indicated by distinct symbols: $\bigcirc$ for CNN backbones, ☆ for vision transformer backbones, and • for stable diffusion backbones. For methods that utilize both vision transformer backbones and stable diffusion backbones (e.g., DINOv2+SD), we use a combination of a ☆ and a • to represent their hybrid architecture.
  • Figure 2: Taxonomy of semantic correspondence methods. This taxonomy provides a comprehensive overview of the diverse approaches to enhance feature quality, matching performance, or training strategies. Only a few representative methods of each category are shown.
  • Figure 3: Pipeline for feature enhancement methods. The feature extractor generates feature maps $F^s$ and $F^t$ from the source image $I^s$ and target image $I^t$, respectively. After channel-wise L2 normalization, their dot product constructs a cosine similarity matrix (2D correlation map) for each query point in $F^s$. The correlation map is transformed into a probability distribution through a localization operation such as soft-argmax, which is then supervised by a ground-truth distribution derived from ground-truth correspondences using the same localization technique. During inference, the correlation map localizes the correspondences$(x_1, y_1), (x_2, y_2), \dots$ in $I^t$.
  • Figure 4: Pipeline for cost volume-based methods. The feature extractor generates feature maps $F^s$ and $F^t$ from the source image $I^s$ and target image $I^t$, respectively. After normalization, their dot product constructs a cost volume for each query point in $F^s$, storing the cosine similarity between all possible feature pairs. This cost volume is refined by a cost aggregator and converted into a probability distribution via softmax, supervised by a ground-truth distribution derived from known correspondences. During inference, the correlation map localizes the correspondences$(x_1, y_1), (x_2, y_2), \dots$ in $I^t$.
  • Figure 5: Pipeline for flow field-based methods. Given a pair of images $I^s$ and $I^t$, feature extraction is performed to obtain dense feature maps $F^s$ and $F^t$ respectively. The cost construction step is then applied to derive a cost volume, which is subsequently transformed into a flow field. The flow field represents the correspondence between each pixel in the target image and its corresponding pixel in the source image.
  • ...and 4 more figures