Table of Contents
Fetching ...

Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps

Octave Mariotti, Oisin Mac Aodha, Hakan Bilen

TL;DR

The paper tackles semantic correspondence under challenging conditions like object symmetries and repeated parts by injecting a weak 3D prior via a sphere-based representation. It trains a category-conditioned spherical prototype to align self-supervised features with sphere coordinates, while enforcing geometric priors through viewpoint, relative distance, and orientation losses. A new evaluation metric, Keypoint Average Precision (KAP), better captures symmetry- and repetition-related failures than traditional PCK. Results on SPair-71k and AwA-pose show improved disambiguation of symmetric views and generalization to unseen categories, with efficient inference when combining sphere maps with SSL features. This approach advances robust dense correspondence by marrying self-supervised signals with lightweight 3D priors and a more discerning evaluation framework.

Abstract

Recent progress in self-supervised representation learning has resulted in models that are capable of extracting image features that are not only effective at encoding image level, but also pixel-level, semantics. These features have been shown to be effective for dense visual semantic correspondence estimation, even outperforming fully-supervised methods. Nevertheless, current self-supervised approaches still fail in the presence of challenging image characteristics such as symmetries and repeated parts. To address these limitations, we propose a new approach for semantic correspondence estimation that supplements discriminative self-supervised features with 3D understanding via a weak geometric spherical prior. Compared to more involved 3D pipelines, our model only requires weak viewpoint information, and the simplicity of our spherical representation enables us to inject informative geometric priors into the model during training. We propose a new evaluation metric that better accounts for repeated part and symmetry-induced mistakes. We present results on the challenging SPair-71k dataset, where we show that our approach demonstrates is capable of distinguishing between symmetric views and repeated parts across many object categories, and also demonstrate that we can generalize to unseen classes on the AwA dataset.

Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps

TL;DR

The paper tackles semantic correspondence under challenging conditions like object symmetries and repeated parts by injecting a weak 3D prior via a sphere-based representation. It trains a category-conditioned spherical prototype to align self-supervised features with sphere coordinates, while enforcing geometric priors through viewpoint, relative distance, and orientation losses. A new evaluation metric, Keypoint Average Precision (KAP), better captures symmetry- and repetition-related failures than traditional PCK. Results on SPair-71k and AwA-pose show improved disambiguation of symmetric views and generalization to unseen categories, with efficient inference when combining sphere maps with SSL features. This approach advances robust dense correspondence by marrying self-supervised signals with lightweight 3D priors and a more discerning evaluation framework.

Abstract

Recent progress in self-supervised representation learning has resulted in models that are capable of extracting image features that are not only effective at encoding image level, but also pixel-level, semantics. These features have been shown to be effective for dense visual semantic correspondence estimation, even outperforming fully-supervised methods. Nevertheless, current self-supervised approaches still fail in the presence of challenging image characteristics such as symmetries and repeated parts. To address these limitations, we propose a new approach for semantic correspondence estimation that supplements discriminative self-supervised features with 3D understanding via a weak geometric spherical prior. Compared to more involved 3D pipelines, our model only requires weak viewpoint information, and the simplicity of our spherical representation enables us to inject informative geometric priors into the model during training. We propose a new evaluation metric that better accounts for repeated part and symmetry-induced mistakes. We present results on the challenging SPair-71k dataset, where we show that our approach demonstrates is capable of distinguishing between symmetric views and repeated parts across many object categories, and also demonstrate that we can generalize to unseen classes on the AwA dataset.
Paper Structure (24 sections, 7 equations, 8 figures, 9 tables)

This paper contains 24 sections, 7 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Features from self-supervised methods such as DINOv2 oquab2023dinov2 have been used to discover parts and regions of objects. However, these features fail to correctly distinguish (i) object symmetries, e.g. the left and right side of the car have the same features and (ii) individual parts, e.g. the wheels are represented by the same features irrespective of their location on the car. Our approach through use of a weak geometric spherical prior addresses these issues. Note, we use the 3D PCA projection of the features for DINOv2 and our learned spherical mapping for our method.
  • Figure 2: Overview of our semantic correspondence estimation approach. We begin by extracting features from a frozen self-supervised backbone, and further use them to predict spherical coordinates via a learned module. Each predicted point is used to query a jointly learned prototype, providing the supervision signal (\ref{['sec:corresp_learning']}). The sphere is used to enforce weak geometric priors (\ref{['sec:geom_priors']}). During inference, SSL features are combined with spherical coordinates (\ref{['sec:alpha_mix']}). A blue outline indicates a fixed module, while orange indicate learned parameters.
  • Figure 3: Illustration of our geometry losses $\mathcal{L}_{rd}$ and $\mathcal{L}_{o}$. The left image shows a spherical map from which a triplet of points is sampled. $\mathcal{L}_{rd}$: as the anchor patch $a$ is closer to the positive $b$ on the image compared with the negative $c$, its corresponding position $s_{a}$ on the sphere must also be closer to $s_b$ than $s_c$. $\mathcal{L}_{o}$: after projecting $s_{b}$ and $s_{c}$ to the plane tangent to the sphere at $s_{a}$, we ensure orientation is preserved by enforcing positive colinearity between $u_b \times u_c$ and the normal vector $n$.
  • Figure 4: Qualitative comparison of dense correspondence maps. For DINOv2, SD, and DINO+SD features we perform PCA on the segmented object features independently for each category, then visualize the three main components. Note that the SD and DINO+SD features are not completely equivalent to the ones used to compute matches, but are provided here for illustration. Spherical maps from $f_S$ (Sphere) for our approach are visualized directly. Our spherical maps correctly identify the different sides of objects, whereas other features fail to capture these differences.
  • Figure A1: Architecture details for our sphere mapper and spherical prototype. C denotes the dimension of the SSL embedding, and block marked with L are simple linear layers used to change dimensionality.
  • ...and 3 more figures