Table of Contents
Fetching ...

Cycle-Correspondence Loss: Learning Dense View-Invariant Visual Features from Unlabeled and Unordered RGB Images

David B. Adrian, Andras Gabor Kupcsik, Markus Spies, Heiko Neumann

TL;DR

This work addresses learning dense, view-invariant visual descriptors for robotic manipulation without ground-truth correspondence labels. It introduces Cycle-Correspondence Loss (CCL), a self-supervised, cycle-consistency-based objective that can train from unordered RGB images by predicting pixel correspondences and validating them through a reverse prediction, with uncertainty-based weighting to handle non-matchable points. Empirically, CCL outperforms other RGB-only self-supervised methods and approaches fully supervised baselines on keypoint tracking and a 6D grasping task, while greatly simplifying data collection (unordered RGB shots). The approach thus offers a practical, data-efficient pathway to robust dense descriptors useful for manipulation tasks, with potential extensions to transformer-based architectures and self-attention for cost estimation in matching.

Abstract

Robot manipulation relying on learned object-centric descriptors became popular in recent years. Visual descriptors can easily describe manipulation task objectives, they can be learned efficiently using self-supervision, and they can encode actuated and even non-rigid objects. However, learning robust, view-invariant keypoints in a self-supervised approach requires a meticulous data collection approach involving precise calibration and expert supervision. In this paper we introduce Cycle-Correspondence Loss (CCL) for view-invariant dense descriptor learning, which adopts the concept of cycle-consistency, enabling a simple data collection pipeline and training on unpaired RGB camera views. The key idea is to autonomously detect valid pixel correspondences by attempting to use a prediction over a new image to predict the original pixel in the original image, while scaling error terms based on the estimated confidence. Our evaluation shows that we outperform other self-supervised RGB-only methods, and approach performance of supervised methods, both with respect to keypoint tracking as well as for a robot grasping downstream task.

Cycle-Correspondence Loss: Learning Dense View-Invariant Visual Features from Unlabeled and Unordered RGB Images

TL;DR

This work addresses learning dense, view-invariant visual descriptors for robotic manipulation without ground-truth correspondence labels. It introduces Cycle-Correspondence Loss (CCL), a self-supervised, cycle-consistency-based objective that can train from unordered RGB images by predicting pixel correspondences and validating them through a reverse prediction, with uncertainty-based weighting to handle non-matchable points. Empirically, CCL outperforms other RGB-only self-supervised methods and approaches fully supervised baselines on keypoint tracking and a 6D grasping task, while greatly simplifying data collection (unordered RGB shots). The approach thus offers a practical, data-efficient pathway to robust dense descriptors useful for manipulation tasks, with potential extensions to transformer-based architectures and self-attention for cost estimation in matching.

Abstract

Robot manipulation relying on learned object-centric descriptors became popular in recent years. Visual descriptors can easily describe manipulation task objectives, they can be learned efficiently using self-supervision, and they can encode actuated and even non-rigid objects. However, learning robust, view-invariant keypoints in a self-supervised approach requires a meticulous data collection approach involving precise calibration and expert supervision. In this paper we introduce Cycle-Correspondence Loss (CCL) for view-invariant dense descriptor learning, which adopts the concept of cycle-consistency, enabling a simple data collection pipeline and training on unpaired RGB camera views. The key idea is to autonomously detect valid pixel correspondences by attempting to use a prediction over a new image to predict the original pixel in the original image, while scaling error terms based on the estimated confidence. Our evaluation shows that we outperform other self-supervised RGB-only methods, and approach performance of supervised methods, both with respect to keypoint tracking as well as for a robot grasping downstream task.
Paper Structure (20 sections, 9 equations, 4 figures, 2 tables)

This paper contains 20 sections, 9 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the cycle-correspondence loss. $\bm{I}_{A}$ and $\bm{I}_{\hat{A}}$ denote versions of the same image, both related through a random image transformation $\thicksim T$. $\bm{I}_{B}$ is a randomly sampled image that exhibits partial content overlap with $\bm{I}_{A}$. We establish a correspondence cycle by randomly sampling location $\bm{k}_{A}$ on $\bm{I}_{A}$, computing a matching distribution $p_B$ over $\bm{I}_{B}$ which we utilize to predict $\bm{k}_{\hat{A}}$ on $\bm{I}_{\hat{A}}$. As location $\bm{k}_{\hat{A}}$ is known through the augmentation, we can optimize the prediction error $l$ to improve the model. We utilize the predicted distributions to scale individual error terms $l$ by the associated uncertainty, effectively dealing with sampled $\bm{k}_{A}$ that have no valid correspondence in $\bm{I}_{B}$.
  • Figure 2: Visualization of the matching uncertainty. The red circle in the left most image marks the sampled keypoint. The following test images are superimposed with the predicted distribution as heatmap. If a correspondence exists (second from left), the mass of the distribution is well localized. If no correspondence exists (middle right and right most image), the mass is spread over various areas that are the most similar in descriptor space. Viewed best in color.
  • Figure 3: Example of hand-annotated, cross-scene keypoint matching test image pair. Occlusion, background changes, strong view-point or object pose changes are induced.
  • Figure 4: Evaluation of prediction accuracy for different quantile $q$ and variance scaling.