Table of Contents
Fetching ...

SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations

Krispin Wandel, Hesheng Wang

TL;DR

SemAlign3D tackles robust semantic correspondence by aligning 3D object-class representations with RGB images. It builds these 3D representations from monocular depth and large vision model features and optimizes a gradient-based alignment energy to match object instances. On SPair-71k, it achieves state-of-the-art PCK@0.1 with substantial gains across rigid categories and a notable overall improvement, highlighting data efficiency and robustness. The work points to a promising direction for 3D-aware, data-efficient semantic alignment and discusses runtime considerations and avenues for future extensions.

Abstract

Semantic correspondence made tremendous progress through the recent advancements of large vision models (LVM). While these LVMs have been shown to reliably capture local semantics, the same can currently not be said for capturing global geometric relationships between semantic object regions. This problem leads to unreliable performance for semantic correspondence between images with extreme view variation. In this work, we aim to leverage monocular depth estimates to capture these geometric relationships for more robust and data-efficient semantic correspondence. First, we introduce a simple but effective method to build 3D object-class representations from monocular depth estimates and LVM features using a sparsely annotated image correspondence dataset. Second, we formulate an alignment energy that can be minimized using gradient descent to obtain an alignment between the 3D object-class representation and the object-class instance in the input RGB-image. Our method achieves state-of-the-art matching accuracy in multiple categories on the challenging SPair-71k dataset, increasing the PCK@0.1 score by more than 10 points on three categories and overall by 3.3 points from 85.6% to 88.9%. Additional resources and code are available at https://dub.sh/semalign3d.

SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations

TL;DR

SemAlign3D tackles robust semantic correspondence by aligning 3D object-class representations with RGB images. It builds these 3D representations from monocular depth and large vision model features and optimizes a gradient-based alignment energy to match object instances. On SPair-71k, it achieves state-of-the-art PCK@0.1 with substantial gains across rigid categories and a notable overall improvement, highlighting data efficiency and robustness. The work points to a promising direction for 3D-aware, data-efficient semantic alignment and discusses runtime considerations and avenues for future extensions.

Abstract

Semantic correspondence made tremendous progress through the recent advancements of large vision models (LVM). While these LVMs have been shown to reliably capture local semantics, the same can currently not be said for capturing global geometric relationships between semantic object regions. This problem leads to unreliable performance for semantic correspondence between images with extreme view variation. In this work, we aim to leverage monocular depth estimates to capture these geometric relationships for more robust and data-efficient semantic correspondence. First, we introduce a simple but effective method to build 3D object-class representations from monocular depth estimates and LVM features using a sparsely annotated image correspondence dataset. Second, we formulate an alignment energy that can be minimized using gradient descent to obtain an alignment between the 3D object-class representation and the object-class instance in the input RGB-image. Our method achieves state-of-the-art matching accuracy in multiple categories on the challenging SPair-71k dataset, increasing the PCK@0.1 score by more than 10 points on three categories and overall by 3.3 points from 85.6% to 88.9%. Additional resources and code are available at https://dub.sh/semalign3d.

Paper Structure

This paper contains 35 sections, 11 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Our method aligns learned 3D object-class representations, displayed as the colored point cloud, with object instances to improve robustness of sparse semantic correspondence against extreme view variation and occlusion, which is still a major issue for existing methods like GeoAware 24_left_right.
  • Figure 2: Learned 3D model representations for all SPair-71k 19_spair categories. Colors represent the principal components of the point cloud features.
  • Figure 3: Construction of Geometric Features. Optimal focal lengths $f_s^*$ can be found by minimizing variance of scale-invariant geometric features $A_{ijkl}^{s=1 \dots n}$ and $R_{ijkl}^{s=1 \dots n}$. We use the state-of-the-art model DepthAnythingV2 24_depth_any for monocular depth estimation. Although the produced depth map (top right) looks visually appealing, it is still far from perfect as evident when back-projected to world coordinates. Nevertheless, in this work we demonstrate that we can still use these depth maps to build coherent 3D object-class representations.
  • Figure 4: Construction of Sparse Keypoint Point Cloud in Canonical Form. Built by iteratively computing the next most likely keypoint location based on the fitted Beta-distributions over $A_{ijkl}^{s=1 \dots n}$ and $R_{ijkl}^{s=1 \dots n}$.
  • Figure 5: Overview of our method. After we constructed the object-class representation, we can minimize $\mathcal{L}_\text{align}$, which is based on similarity maps (here we only show 4) between the input RGB-image and the representation, to obtain the alignment. The green dots in the final image on the right represent the ground-truth keypoints.
  • ...and 5 more figures