Table of Contents
Fetching ...

Zero-Shot Image Feature Consensus with Deep Functional Maps

Xinle Cheng, Congyue Deng, Adam Harley, Yixin Zhu, Leonidas Guibas

TL;DR

This work points out that a better correspondence strategy is available, which directly imposes structure on the correspondence field: the functional map, and lifts the correspondence problem from the pixel space to the function space and directly optimize for mappings that are globally coherent.

Abstract

Correspondences emerge from large-scale vision models trained for generative and discriminative tasks. This has been revealed and benchmarked by computing correspondence maps between pairs of images, using nearest neighbors on the feature grids. Existing work has attempted to improve the quality of these correspondence maps by carefully mixing features from different sources, such as by combining the features of different layers or networks. We point out that a better correspondence strategy is available, which directly imposes structure on the correspondence field: the functional map. Wielding this simple mathematical tool, we lift the correspondence problem from the pixel space to the function space and directly optimize for mappings that are globally coherent. We demonstrate that our technique yields correspondences that are not only smoother but also more accurate, with the possibility of better reflecting the knowledge embedded in the large-scale vision models that we are studying. Our approach sets a new state-of-the-art on various dense correspondence tasks. We also demonstrate our effectiveness in keypoint correspondence and affordance map transfer.

Zero-Shot Image Feature Consensus with Deep Functional Maps

TL;DR

This work points out that a better correspondence strategy is available, which directly imposes structure on the correspondence field: the functional map, and lifts the correspondence problem from the pixel space to the function space and directly optimize for mappings that are globally coherent.

Abstract

Correspondences emerge from large-scale vision models trained for generative and discriminative tasks. This has been revealed and benchmarked by computing correspondence maps between pairs of images, using nearest neighbors on the feature grids. Existing work has attempted to improve the quality of these correspondence maps by carefully mixing features from different sources, such as by combining the features of different layers or networks. We point out that a better correspondence strategy is available, which directly imposes structure on the correspondence field: the functional map. Wielding this simple mathematical tool, we lift the correspondence problem from the pixel space to the function space and directly optimize for mappings that are globally coherent. We demonstrate that our technique yields correspondences that are not only smoother but also more accurate, with the possibility of better reflecting the knowledge embedded in the large-scale vision models that we are studying. Our approach sets a new state-of-the-art on various dense correspondence tasks. We also demonstrate our effectiveness in keypoint correspondence and affordance map transfer.
Paper Structure (38 sections, 14 equations, 6 figures, 8 tables)

This paper contains 38 sections, 14 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview.Left: Given two sets of features, $E^M, E^N$, and $F^M, F^N$, we compute the Laplacian eigenfunction basis with $E^M, E^N$, and apply regularizations to the functional map optimization using $F^M, F^N$. This method optimizes a mapping in the spectral domain derived from one feature set to achieve a consensus with the other set. Right: With a better understanding of the global image structure, our method produces smoother and more accurate correspondences in a zero-shot manner.
  • Figure 2: Eigenfunctions of the image Laplacian. We visualize the eigenfunctions of the graph Laplacian operator corresponding to the first 5 smallest eigenvalues $\lambda_1,\cdots,\lambda_5$ (low frequency) as well as $\lambda_{10}, \lambda_{20}, \lambda_{50}$ (high frequency).
  • Figure 3: Dense correspondences on SPair-71k min2019spair Image Pairs. Each example displays pixel-wise mappings from source to target images in rainbow colors (second column for source coordinates, fourth and fifth columns for computed target coordinates) and color transfers (last two columns). Specifically, we demonstrate the challenging examples including significant viewpoint changes (first and second row), shape variations (first and third row), and occlusions (third row). Our framework achieves more consistent mappings with its global structure-awareness.
  • Figure 4: Sparse keypoint correspondences on SPair-71k min2019spair image pairs. Correct matches are connected with blue lines and incorrect matches with red lines.
  • Figure 5: Transferring tool affordances represented as heat maps. We treat affordance heat maps as functions defined on the source and the target image. By optimizing the functional map between the source and the target, we manage to transfer the function after applying the functional map to it directly following \ref{['eq1']}. We employ features from DINOv2-ViT-B/14 and Stable Diffusion to compute the functional maps in this experiment.
  • ...and 1 more figures