Table of Contents
Fetching ...

Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild

Jiin Im, Sisung Liu, Je Hyeong Hong

Abstract

Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.

Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild

Abstract

Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.
Paper Structure (54 sections, 11 equations, 13 figures, 10 tables)

This paper contains 54 sections, 11 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Our Fused Gromov-Wasserstein approach combines inter-feature matching with intra-geometric consistency. (Top) Feature matching yields false correspondences (red) when distinct points ($x, x_{bad}$) share similar features to $y$. (Bottom) 3D Gromov-Wasserstein penalizes distortions to filter invalid matches.
  • Figure 1: Pseudo-label analysis with DINOv2 backbone. Performance (PCK$_{\text{label}}$@0.1) on the geometry-aware subset (left) and overall set (right). Using only DINOv2 features, we compare four pseudo-label generation strategies: Nearest Neighbor, Semantic OT, Fused OT, and Fused UOT. Both the geometry-aware and overall subsets show a consistent, monotonic improvement across methods, with clear gains from incorporating semantic OT and our geometry-aware matching. This confirms that our pseudo-labeling remains effective even with a weaker backbone.
  • Figure 2: Overview of our pseudo-label generation pipeline. We first compute an initial semantic match to identify high-confidence anchors. These anchors are then used to create a tractable, linear approximation of the otherwise intractable quadratic Gromov-Wasserstein (GW) geometric cost. This approximated geometric cost is fused with the semantic cost, yielding a final fused cost matrix. This cost matrix is then used to solve an UOT problem, producing a transport plan $\pi^{(T)}$ that serves as our pseudo-label robust to geometric ambiguities.
  • Figure 2: Pseudo-label hyperparameter ablation on SPair-71k (PCK$_{\text{label}}$@0.1). We study three key hyperparameters of our pseudo-label generator: (Left) number of anchors $K$, (Middle) feature–geometry trade-off $\alpha$, and (Right) KL regularization strength $\rho$ in UOT. In all cases, the performance is measured as PCK$_{\text{label}}$@0.1 on the SPair-71k test set.
  • Figure 3: Impact of the Gromov-Wasserstein term. Wasserstein baseline (blue) vs. Fused Gromov-Wasserstein (orange). Incorporating geometric structural consistency leads to consistent improvements (+2.3 $\sim$ +7.0%p) across categories.
  • ...and 8 more figures