Table of Contents
Fetching ...

Unsupervised Deep Graph Matching Based on Cycle Consistency

Siddharth Tourani, Carsten Rother, Muhammad Haris Khan, Bogdan Savchynskyy

TL;DR

The paper tackles unsupervised deep graph matching for keypoint correspondence by introducing a discrete cycle-consistency loss as supervision signal. It leverages black-box differentiation to backprop through combinatorial solvers for LAP and QAP, enabling end-to-end training with arbitrary neural architectures. A flexible network architecture combines a VGG16 backbone, SplineCNN refinements, and self- and cross-attention to produce unary and edge costs for matching, achieving state-of-the-art performance on standard benchmarks without ground-truth matches. Empirical results on Pascal VOC, Willow, and SPair-71K demonstrate robust unsupervised performance and the importance of attention mechanisms, with ablations highlighting the value of cycle-consistency and solver-agnostic design.

Abstract

We contribute to the sparsely populated area of unsupervised deep graph matching with application to keypoint matching in images. Contrary to the standard \emph{supervised} approach, our method does not require ground truth correspondences between keypoint pairs. Instead, it is self-supervised by enforcing consistency of matchings between images of the same object category. As the matching and the consistency loss are discrete, their derivatives cannot be straightforwardly used for learning. We address this issue in a principled way by building our method upon the recent results on black-box differentiation of combinatorial solvers. This makes our method exceptionally flexible, as it is compatible with arbitrary network architectures and combinatorial solvers. Our experimental evaluation suggests that our technique sets a new state-of-the-art for unsupervised graph matching.

Unsupervised Deep Graph Matching Based on Cycle Consistency

TL;DR

The paper tackles unsupervised deep graph matching for keypoint correspondence by introducing a discrete cycle-consistency loss as supervision signal. It leverages black-box differentiation to backprop through combinatorial solvers for LAP and QAP, enabling end-to-end training with arbitrary neural architectures. A flexible network architecture combines a VGG16 backbone, SplineCNN refinements, and self- and cross-attention to produce unary and edge costs for matching, achieving state-of-the-art performance on standard benchmarks without ground-truth matches. Empirical results on Pascal VOC, Willow, and SPair-71K demonstrate robust unsupervised performance and the importance of attention mechanisms, with ablations highlighting the value of cycle-consistency and solver-agnostic design.

Abstract

We contribute to the sparsely populated area of unsupervised deep graph matching with application to keypoint matching in images. Contrary to the standard \emph{supervised} approach, our method does not require ground truth correspondences between keypoint pairs. Instead, it is self-supervised by enforcing consistency of matchings between images of the same object category. As the matching and the consistency loss are discrete, their derivatives cannot be straightforwardly used for learning. We address this issue in a principled way by building our method upon the recent results on black-box differentiation of combinatorial solvers. This makes our method exceptionally flexible, as it is compatible with arbitrary network architectures and combinatorial solvers. Our experimental evaluation suggests that our technique sets a new state-of-the-art for unsupervised graph matching.
Paper Structure (27 sections, 10 equations, 5 figures, 6 tables)

This paper contains 27 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of cycle consistency in multi-graph matching (best viewed in color). There are three nodes in each image. They are labelled by both color (blue, green, purple) and numbers (1, 2, 3). Matches between pairs of nodes are shown by colored lines. $A \leftrightarrow B$, $B \leftrightarrow C$ and $C \leftrightarrow A$ are color coded with yellow, light purple and light pink lines. Correct matches are shown by solid, wrong matches by dotted lines. Matching of the node 2 is cycle consistent across the images, whereas nodes 1 and 3 are not.
  • Figure 2: Overview of our framework for a batch of 3 images. Features extracted from images and keypoint positions are transformed into matching costs for each pair of images. The QAP$_{ij}$ blocks compute the matching either as LAP or QAP. At the end the cycle loss counts a number of inconsistent cycles and computes a gradient for back propagation.
  • Figure 3: (a) Partial loss illustration for a triple of indices $i,s,k$ and the respective binary variables $x^{12},x^{23},x^{31}$. The solid lines for $x^{23}$ and $x^{31}$ denote that these variables are equal to 1 and correspond to an actual matching between the respective points. The dashed line for $x^{12}$ denotes that this variable is equal to 0 and therefore points indexed by $i$ and $s$ are not matched to each other. Given the values of $x^{23}$ and $x^{31}$ this violates cycle consistency. (b-e) Illustration of the values of the derivative $\partial{\ell}/\partial{x^{12}}$. The meaning of the solid and dashed lines as well as the position of $x^{12}$, $x^{23}$ and $x^{31}$ are the same as in (a). The thick blue dotted lines mean that $x^{12}$ can be either 0 or 1, since $\partial{\ell}/\partial{x^{12}}$ is independent on $x^{12}$, see \ref{['eqn:loss-gradient']}. So, for instance, $\partial{\ell}/\partial{x^{12}} = 1$ for $x^{23}=0$ and $x^{31}=1$ as illustrated by (c).
  • Figure 4: Information flow for feature processing and matching instance construction. The feature extraction layer is shown in the blue box. Input to the pipeline are image-keypoint pairs, $(\texttt{Im}^1,\texttt{KP}^1)$, $(\texttt{Im}^2,\texttt{KP}^2)$ in the figure. The features extracted via a pre-trained VGG16 backbone network are refined by SplineCNN layers. The outputs of the SplineCNN layers are subsequently passed through self-attention (SA) with relative position encoding (RPE) and cross-attention (CA)layers and finally used in the construction of a matching instance. NC and EC denote node and edge costs. See the detailed description in the main text.
  • Figure 5: Visualization of matching results on the SPair-71K dataset. In addition to the unsupervised techniques GANN, CL-BBGM, CLUM we show results of the fully supervised BBGM as a baseline. Correctly matched keypoints are shown as green dots, whereas incorrect matches are represented by red lines. The matched keypoints have in general similar appearance that suggests sensible unary costs. Improving of the matching quality from top to bottom is arguably mainly due to improving the pairwise costs, with the fully supervised BBGM method showing the best results.