Table of Contents
Fetching ...

SHIC: Shape-Image Correspondences with no Keypoint Supervision

Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi

TL;DR

This work introduces SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories and shows that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.

Abstract

Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularised by DensePose for the analysis of humans, authors have since attempted to apply the concept to more categories, but with limited success due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories. Our idea is to leverage foundation computer vision models such as DINO and Stable Diffusion that are open-ended and thus possess excellent priors over natural categories. SHIC reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from the foundation models. The reduction works by matching images of the object to non-photorealistic renders of the template, which emulates the process of collecting manual annotations for this task. These correspondences are then used to supervise high-quality canonical maps for any object of interest. We also show that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.

SHIC: Shape-Image Correspondences with no Keypoint Supervision

TL;DR

This work introduces SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories and shows that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.

Abstract

Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularised by DensePose for the analysis of humans, authors have since attempted to apply the concept to more categories, but with limited success due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories. Our idea is to leverage foundation computer vision models such as DINO and Stable Diffusion that are open-ended and thus possess excellent priors over natural categories. SHIC reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from the foundation models. The reduction works by matching images of the object to non-photorealistic renders of the template, which emulates the process of collecting manual annotations for this task. These correspondences are then used to supervise high-quality canonical maps for any object of interest. We also show that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.
Paper Structure (37 sections, 6 equations, 15 figures, 6 tables)

This paper contains 37 sections, 6 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Unsupervised canonical maps. We show predictions from our fully unsupervised method SHIC, which finds correspondences between a rigid 3D template and a natural image. Correspondences are color-coded by assigning a distinct color to each template surface point. Our approach is highly data-efficient; the elephant, T-Rex, and Appa models above are trained on only 2800, 480, and 180 images, respectively.
  • Figure 2: Image-to-template correspondences using 2D renderings. Using an unsupervised semantic correspondence method, we can find correspondences between an image of an object and a rendering of its 3D template. Here we show the similarity heatmap from the source location (annotated in red) to all pixel locations in the target image using SD-DINO zhang2023sd-dino.
  • Figure 3: Zero-shot image-to-template correspondences. From left to right: an image $I$ with a selected pixel $u$; several views $J_i$ of the synthetic template; corresponding renderings and similarities $S_{IJ_i}(u,v)$ as functions of the target locations $v \in \Omega$; the final similarities $\Sigma_I(u)$ visualized as a heatmap on top of the canonical surface $M$. The maximizer of the latter (red dot) identifies the vertex $x_k$ that best corresponds to the selected pixel $u$ in the source image $I$ (i.e., base of the left ear of the cat).
  • Figure 4: CSE dense pose predictor. We jointly train a deep network $\Phi$ and a matrix $C$, that transforms LBO eigenvectors to a shared $D$-dimensional space. We use pseudo-ground truth, obtained as described in \ref{['sec:image-to-3d-matching']} for supervision. The image encoder is a frozen pre-trained DINO ViT, and the decoder we learn is a CNN.
  • Figure 5: Realistic rendering of the template. We create synthetic data for pixel-vertex correspondences by generating photorealistic images from depth renders. The corresponding vertices we obtain from the projections of vertices on the image.
  • ...and 10 more figures