Table of Contents
Fetching ...

Semi-supervised Dense Keypoints Using Unlabeled Multiview Images

Zhixuan Yu, Haozheng Yu, Long Sha, Sujoy Ganguly, Hyun Soo Park

TL;DR

A new end-to-end semi-supervised framework to learn a dense keypoint detector using unlabeled multiview images that shows superior performance compared to existing methods, including non-differentiable bootstrapping in terms of keypoint accuracy, multiview consistency, and 3D reconstruction accuracy.

Abstract

This paper presents a new end-to-end semi-supervised framework to learn a dense keypoint detector using unlabeled multiview images. A key challenge lies in finding the exact correspondences between the dense keypoints in multiple views since the inverse of the keypoint mapping can be neither analytically derived nor differentiated. This limits applying existing multiview supervision approaches used to learn sparse keypoints that rely on the exact correspondences. To address this challenge, we derive a new probabilistic epipolar constraint that encodes the two desired properties. (1) Soft correspondence: we define a matchability, which measures a likelihood of a point matching to the other image's corresponding point, thus relaxing the requirement of the exact correspondences. (2) Geometric consistency: every point in the continuous correspondence fields must satisfy the multiview consistency collectively. We formulate a probabilistic epipolar constraint using a weighted average of epipolar errors through the matchability thereby generalizing the point-to-point geometric error to the field-to-field geometric error. This generalization facilitates learning a geometrically coherent dense keypoint detection model by utilizing a large number of unlabeled multiview images. Additionally, to prevent degenerative cases, we employ a distillation-based regularization by using a pretrained model. Finally, we design a new neural network architecture, made of twin networks, that effectively minimizes the probabilistic epipolar errors of all possible correspondences between two view images by building affinity matrices. Our method shows superior performance compared to existing methods, including non-differentiable bootstrapping in terms of keypoint accuracy, multiview consistency, and 3D reconstruction accuracy.

Semi-supervised Dense Keypoints Using Unlabeled Multiview Images

TL;DR

A new end-to-end semi-supervised framework to learn a dense keypoint detector using unlabeled multiview images that shows superior performance compared to existing methods, including non-differentiable bootstrapping in terms of keypoint accuracy, multiview consistency, and 3D reconstruction accuracy.

Abstract

This paper presents a new end-to-end semi-supervised framework to learn a dense keypoint detector using unlabeled multiview images. A key challenge lies in finding the exact correspondences between the dense keypoints in multiple views since the inverse of the keypoint mapping can be neither analytically derived nor differentiated. This limits applying existing multiview supervision approaches used to learn sparse keypoints that rely on the exact correspondences. To address this challenge, we derive a new probabilistic epipolar constraint that encodes the two desired properties. (1) Soft correspondence: we define a matchability, which measures a likelihood of a point matching to the other image's corresponding point, thus relaxing the requirement of the exact correspondences. (2) Geometric consistency: every point in the continuous correspondence fields must satisfy the multiview consistency collectively. We formulate a probabilistic epipolar constraint using a weighted average of epipolar errors through the matchability thereby generalizing the point-to-point geometric error to the field-to-field geometric error. This generalization facilitates learning a geometrically coherent dense keypoint detection model by utilizing a large number of unlabeled multiview images. Additionally, to prevent degenerative cases, we employ a distillation-based regularization by using a pretrained model. Finally, we design a new neural network architecture, made of twin networks, that effectively minimizes the probabilistic epipolar errors of all possible correspondences between two view images by building affinity matrices. Our method shows superior performance compared to existing methods, including non-differentiable bootstrapping in terms of keypoint accuracy, multiview consistency, and 3D reconstruction accuracy.

Paper Structure

This paper contains 20 sections, 17 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We use unlabeled multiview images to learn a dense keypoint model via the epipolar geometry in an end-to-end fashion. As a byproduct, we can reconstruct the 3D body surface by triangulating visible regions of body parts.
  • Figure 2: A dense keypoint field maps a point in an image to the canonical body surface coordinate, i.e., $\mathbf{u} = \phi(\mathbf{x};\mathcal{I})$. Establishing a correspondence between two view images requires the analytic inverse of $\phi$ which does not exist in general. We present a matchability $M(\mathbf{x}, \mathbf{x}'; \phi, \mathcal{I}, \mathcal{I}')$, a likelihood of matching through the body surface coordinate. We combine the matchability with the epipolar error $d(\mathbf{x}, \mathbf{x}'; \mathbf{F})$ to obtain a probabilistic epipolar error $\mathds{E}(\mathbf{x}, \mathbf{x}')$.
  • Figure 3: Our multiview supervision progressively minimizes the epipolar error between two views (top and bottom) as learning the dense keypoint detection model. The keypoint detection, independently by a pretrained model (Iter 0), is not geometrically consistent. As the optimization progresses, the error is significantly reduced, resulting in a geometrically coherent model.
  • Figure 4: We design a new architecture composed of twin networks that detect dense keypoint fields. The dense keypoint fields from two views are combined to form two affinity matrices: matchability $\mathbf{M}$ and epipolar error $\mathbf{E}$. $\mathbf{M}$ is obtained from the dense keypoint fields ($\mathbf{u}$ and $\mathbf{u}'$), and $\mathbf{E}$ is obtained from the epipolar error of pixel coordinates ($\mathbf{x}$ and $\mathbf{x}'$). These matrices allow us to compute dense epipolar errors and subsequent multiview geometric consistency loss $\mathcal{L}_{\rm M}$. Same operations are applied to compute $\mathcal{L}_{\rm T}$. In addition, we make use of distillation-based regularization using a pretrained model $\phi_0$ to avoid degenerate cases ($\mathcal{L}_{\rm R}$). We measure the labeled loss $\mathcal{L}_{\rm L}$ if the ground truth dense keypoint field is available. $\odot$ is the element-wise multiplication of matrices. $\ominus$ is the minus operation between dense keypoint predictions.
  • Figure 5: Qualitative results on Human3.6M, Ski-Pose PTZ-Camera and OpenMonkeyPose Datasets. Heatmaps overlapping on images indicate epipolar error for each pixels.
  • ...and 2 more figures