Table of Contents
Fetching ...

Distillation of Diffusion Features for Semantic Correspondence

Frank Fundel, Johannes Schusterbauer, Vincent Tao Hu, Björn Ommer

TL;DR

The paper tackles semantic correspondence by distilling the complementary representations of two large vision foundation models into a single, parameter-efficient student using LoRA. It introduces an unsupervised 3D data augmentation pipeline based on multi-view depth information to fine-tune the distilled model without labeled data. The approach achieves state-of-the-art performance on standard benchmarks while delivering substantially higher throughput and fewer parameters, enabling real-time applications such as semantic video correspondence. Overall, multi-teacher distillation with 3D augmentation yields a practical, high-accuracy solution for semantic alignment under constrained compute budgets.

Abstract

Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on this progress, current state-of-the-art methods rely on combining multiple large models, resulting in high computational demands and reduced efficiency. In this work, we address this challenge by proposing a more computationally efficient approach. We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Furthermore, we demonstrate that by incorporating 3D data, we are able to further improve performance, without the need for human-annotated correspondences. Overall, our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence. Our code and weights are publicly available on our project page.

Distillation of Diffusion Features for Semantic Correspondence

TL;DR

The paper tackles semantic correspondence by distilling the complementary representations of two large vision foundation models into a single, parameter-efficient student using LoRA. It introduces an unsupervised 3D data augmentation pipeline based on multi-view depth information to fine-tune the distilled model without labeled data. The approach achieves state-of-the-art performance on standard benchmarks while delivering substantially higher throughput and fewer parameters, enabling real-time applications such as semantic video correspondence. Overall, multi-teacher distillation with 3D augmentation yields a practical, high-accuracy solution for semantic alignment under constrained compute budgets.

Abstract

Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on this progress, current state-of-the-art methods rely on combining multiple large models, resulting in high computational demands and reduced efficiency. In this work, we address this challenge by proposing a more computationally efficient approach. We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Furthermore, we demonstrate that by incorporating 3D data, we are able to further improve performance, without the need for human-annotated correspondences. Overall, our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence. Our code and weights are publicly available on our project page.

Paper Structure

This paper contains 14 sections, 7 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 2: Illustration of our multi-teacher distillation framework (a) and 3D data augmentation method (b). We distill two complementary models, DINOv2 and SDXL Turbo, into one single and more efficient model. Using unsupervised 3D data augmentation we further refine our distilled model to achieve new state-of-the-art in both throughput and performance.
  • Figure 3: Examples image pairs from SPair-71k with predicted correspondences of different methods.Green indicates correct, while red indicates incorrect according to $\text{PCK}_{\text{bbox}}$@0.1. $(840 \times 840)$ was used as input resolution for DINOv2.
  • Figure 4: Video semantic correspondence sample, showing accurate correspondences at a high frame rate. We use source points on the first frame to calculate the corresponding points on all other frames at almost 30 FPS on an NVIDIA A100 80GB.
  • Figure 5: Ablation of the rank parameter of LoRA, evaluated on SPair-71k. Trained for 20 epochs on COCO with retrieval sampling. The #Params correspond to the ranks: 4, 8, 16 and 32, respectively.
  • Figure 6: Zero-shot foreground-background differentiation using k-means. Our distilled model produces segmentation masks with less noisy edges.
  • ...and 3 more figures