Table of Contents
Fetching ...

Patch-Level Kernel Alignment for Dense Self-Supervised Learning

Juan Yeo, Ijun Jang, Taesup Kim

TL;DR

The paper addresses the challenge of learning high-quality dense representations in vision models without parametric distribution assumptions. It introduces Patch-level Kernel Alignment (PaKA), a non-parametric post-(pre)training method that uses Centered Kernel Alignment to align the patch-level relational structure between teacher and student representations, avoiding clustering or memory banks. PaKA is complemented by two augmentation refinements—Global–Local Intersection Maximization and an augmentation-free teacher—to preserve spatial information and stabilize targets. Empirically, PaKA achieves state-of-the-art performance on multiple dense-vision benchmarks (VOC2012, ADE20K, COCO-derived tasks) with only about 14 hours of single-GPU training, and it generalizes across ViT backbones, offering substantial efficiency and accuracy gains over prior methods.

Abstract

Dense self-supervised learning (SSL) methods showed its effectiveness in enhancing the fine-grained semantic understandings of vision models. However, existing approaches often rely on parametric assumptions or complex post-processing (e.g., clustering, sorting), limiting their flexibility and stability. To overcome these limitations, we introduce Patch-level Kernel Alignment (PaKA), a non-parametric, kernel-based approach that improves the dense representations of pretrained vision encoders with a post-(pre)training. Our method propose a robust and effective alignment objective that captures statistical dependencies which matches the intrinsic structure of high-dimensional dense feature distributions. In addition, we revisit the augmentation strategies inherited from image-level SSL and propose a refined augmentation strategy for dense SSL. Our framework improves dense representations by conducting a lightweight post-training stage on top of a pretrained model. With only 14 hours of additional training on a single GPU, our method achieves state-of-the-art performance across a range of dense vision benchmarks, demonstrating both efficiency and effectiveness.

Patch-Level Kernel Alignment for Dense Self-Supervised Learning

TL;DR

The paper addresses the challenge of learning high-quality dense representations in vision models without parametric distribution assumptions. It introduces Patch-level Kernel Alignment (PaKA), a non-parametric post-(pre)training method that uses Centered Kernel Alignment to align the patch-level relational structure between teacher and student representations, avoiding clustering or memory banks. PaKA is complemented by two augmentation refinements—Global–Local Intersection Maximization and an augmentation-free teacher—to preserve spatial information and stabilize targets. Empirically, PaKA achieves state-of-the-art performance on multiple dense-vision benchmarks (VOC2012, ADE20K, COCO-derived tasks) with only about 14 hours of single-GPU training, and it generalizes across ViT backbones, offering substantial efficiency and accuracy gains over prior methods.

Abstract

Dense self-supervised learning (SSL) methods showed its effectiveness in enhancing the fine-grained semantic understandings of vision models. However, existing approaches often rely on parametric assumptions or complex post-processing (e.g., clustering, sorting), limiting their flexibility and stability. To overcome these limitations, we introduce Patch-level Kernel Alignment (PaKA), a non-parametric, kernel-based approach that improves the dense representations of pretrained vision encoders with a post-(pre)training. Our method propose a robust and effective alignment objective that captures statistical dependencies which matches the intrinsic structure of high-dimensional dense feature distributions. In addition, we revisit the augmentation strategies inherited from image-level SSL and propose a refined augmentation strategy for dense SSL. Our framework improves dense representations by conducting a lightweight post-training stage on top of a pretrained model. With only 14 hours of additional training on a single GPU, our method achieves state-of-the-art performance across a range of dense vision benchmarks, demonstrating both efficiency and effectiveness.

Paper Structure

This paper contains 55 sections, 5 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of Patch-level Kernal Alignment (PaKA) for Dense Post-(pre)training. PaKA is a student-teacher framework that aligns dense patch representations by comparing their relational structures, enabling the student to capture the teacher’s fine-grained feature relationships without requiring complex algorithms or memory banks.
  • Figure 2: Why CKA leads to better alignment than Gram-matrix in dense post-(pre)training. (a) 2D PCA projection of 5,000 teacher and student dense representations. (b) Normalized training loss curves for the Gram-matrix and CKA losses. Complete training loss curves can be found in the Appendix. (c) Overclustering performance on ADE20K.
  • Figure 3: Our proposed augmentation strategies and their empirical validation. (a) Conceptual overview of maximizing view intersection and employing a clean teacher. (b) Performance, measured as mIoU in an overclustering task(K=500) on Pascal VOC data_pascalvoc2012, significantly improves as the minimum intersection ratio between views is increased. (c) Student model performance peaks when teacher augmentation strength is minimized. Detailed experimental results are provided in the Appendix.
  • Figure 4: Visualization of Overclustering. Results of Overclustering for DINOv2R, NeCo, and PaKA on Pascal VOC. Colored overlays represent matched semantic clusters derived via K-means.
  • Figure 5: Visualization of Vision In-Context Learning. This figure contrasts the top five nearest neighbors retrieved by PaKA versus DINOv2R on Pascal VOC. PaKA consistently finds more semantically relevant and precise patches, including specific object parts.
  • ...and 1 more figures