Cycle-Correspondence Loss: Learning Dense View-Invariant Visual Features from Unlabeled and Unordered RGB Images
David B. Adrian, Andras Gabor Kupcsik, Markus Spies, Heiko Neumann
TL;DR
This work addresses learning dense, view-invariant visual descriptors for robotic manipulation without ground-truth correspondence labels. It introduces Cycle-Correspondence Loss (CCL), a self-supervised, cycle-consistency-based objective that can train from unordered RGB images by predicting pixel correspondences and validating them through a reverse prediction, with uncertainty-based weighting to handle non-matchable points. Empirically, CCL outperforms other RGB-only self-supervised methods and approaches fully supervised baselines on keypoint tracking and a 6D grasping task, while greatly simplifying data collection (unordered RGB shots). The approach thus offers a practical, data-efficient pathway to robust dense descriptors useful for manipulation tasks, with potential extensions to transformer-based architectures and self-attention for cost estimation in matching.
Abstract
Robot manipulation relying on learned object-centric descriptors became popular in recent years. Visual descriptors can easily describe manipulation task objectives, they can be learned efficiently using self-supervision, and they can encode actuated and even non-rigid objects. However, learning robust, view-invariant keypoints in a self-supervised approach requires a meticulous data collection approach involving precise calibration and expert supervision. In this paper we introduce Cycle-Correspondence Loss (CCL) for view-invariant dense descriptor learning, which adopts the concept of cycle-consistency, enabling a simple data collection pipeline and training on unpaired RGB camera views. The key idea is to autonomously detect valid pixel correspondences by attempting to use a prediction over a new image to predict the original pixel in the original image, while scaling error terms based on the estimated confidence. Our evaluation shows that we outperform other self-supervised RGB-only methods, and approach performance of supervised methods, both with respect to keypoint tracking as well as for a robot grasping downstream task.
