Near, far: Patch-ordering enhances vision foundation models' scene understanding

Valentinos Pariza; Mohammadreza Salehi; Gertjan Burghouts; Francesco Locatello; Yuki M. Asano

Near, far: Patch-ordering enhances vision foundation models' scene understanding

Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, Yuki M. Asano

TL;DR

This work tackles the limitation of binary learning signals in self-supervised vision models by introducing NeCo, a dense post-pretraining objective that enforces Patch Neighbor Consistency via differentiable sorting of patch-level distances. Built on a teacher-student ViT framework with ROI Align, NeCo aligns and orders patch neighborhoods across two views to yield a fine-grained supervision signal. It delivers large, consistent gains across dense tasks, including in-context semantic segmentation, linear segmentation, end-to-end segmentation, 3D multiview consistency, and even vision-language transfer, often surpassing prior state-of-the-art methods with modest compute (~19 GPU-hours). The findings underscore the value of preserving and aligning patch-level semantics and ordering relationships, suggesting broad applicability to dense representations in vision foundation models and downstream tasks.

Abstract

We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training loss that enforces patch-level nearest neighbor consistency across a student and teacher model. Compared to contrastive approaches that only yield binary learning signals, i.e., 'attract' and 'repel', this approach benefits from the more fine-grained learning signal of sorting spatially dense features relative to reference patches. Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. This method generates high-quality dense feature encoders and establishes several new state-of-the-art results such as +5.5% and +6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff and improvements in the 3D understanding of multi-view consistency on SPair-71k, by more than 1.5%.

Near, far: Patch-ordering enhances vision foundation models' scene understanding

TL;DR

Abstract

Paper Structure (47 sections, 6 equations, 7 figures, 17 tables)

This paper contains 47 sections, 6 equations, 7 figures, 17 tables.

Introduction
Related Works
Patch Neighbor Consistency
Feature Extraction and Alignment.
Pairwise Distance Computation.
Differentiable Sorting of Distances.
Training Loss.
Experiments
Setup
Comparison to State-of-the-Art
Frozen Clustering-based Evaluations.
Ablation Studies
Conclusion
Experimental Setup
Dense Post-Pretraining
...and 32 more sections

Figures (7)

Figure 1: NeCo overview. Given an input image $I$, two augmentations $\tau_1$ and $\tau_2$ are applied to create two different views, which are processed by the teacher and student encoders, $\phi_t$ and $\phi_s$ respectively. The teacher encoder is updated using Exponential Moving Average (EMA). The encoded features are then aligned using ROI Align and compared with reference features $F_r$ obtained by applying $\phi_t$ to other batch images. Next, pairwise distances $D_{ij}$ between $F_s$ and $F_r$, as well as between $F_t$ and $F_r$, are computed using cosine similarity. These distances are then sorted using differentiable sorting and utilized to force nearest order consistency across the views through the NeCo loss.
Figure 2: In-context scene understanding benchmark. Dense nearest neighbor retrieval performance is reported across various training data proportions on two scene-centric datasets, Pascal VOC and ADE20k. The retrieved cluster maps are compared with the ground truth using Hungarian matching kuang2021video, and their mIoU score is reported. For all models, ViT-B16 is used except for DINOv2R and NeCo, where it is ViT-B14. For full tables, refer to Appendix \ref{['app:hb_tables']}
Figure 3: Pascal VOC visualizations. We overlay the ground truth on top of a subset of images in Pascal VOC. These images and their ground truth segmentation maps are used for our tasks, such as visual in-context learning and linear segmentation.
Figure 4: Nearest patch retrieval. Comparison of nearest neighbor retrieval results between NeCo and DINOv2R on Pascal VOC. For each query patch, NeCo retrieves more relevant and precise nearest patches, accurately identifying patches within the same object and object parts.
Figure 5: Borderline cases.NeCo , sometimes retrieves patches of similar parts from different objects. For example, a patch from a bicycle wheel might be matched with a motorcycle wheel. Additionally, since we rely on cropping to induce nearest neighbor similarity, small objects in the input, which may not significantly affect the overall semantics, can alter the semantics at the patch level, leading to unexpected nearest neighbors, as seen in the case of the sheep photo.
...and 2 more figures

Near, far: Patch-ordering enhances vision foundation models' scene understanding

TL;DR

Abstract

Near, far: Patch-ordering enhances vision foundation models' scene understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)