Table of Contents
Fetching ...

VICRegL: Self-Supervised Learning of Local Visual Features

Adrien Bardes, Jean Ponce, Yann LeCun

TL;DR

The paper addresses the need for self-supervised learning methods that capture both global image semantics and local spatial structure without segmentation masks.It introduces VICRegL, a two-branch framework that adds local feature matching (location- and embedding-based) to the VICReg global objective, enabling simultaneous learning of global and local representations.Through extensive experiments on ImageNet, Pascal VOC, Cityscapes, and ADE20k, VICRegL demonstrates strong segmentation gains while preserving classification performance, with ConvNeXt backbones yielding notable improvements.The work presents a detailed ablation analysis and qualitative visualizations, highlighting the trade-off between local and global cues and the robustness of mask-free local matching.

Abstract

Most recent self-supervised methods for learning image representations focus on either producing a global feature with invariance properties, or producing a set of local features. The former works best for classification tasks while the latter is best for detection and segmentation tasks. This paper explores the fundamental trade-off between learning local and global features. A new method called VICRegL is proposed that learns good global and local features simultaneously, yielding excellent performance on detection and segmentation tasks while maintaining good performance on classification tasks. Concretely, two identical branches of a standard convolutional net architecture are fed two differently distorted versions of the same image. The VICReg criterion is applied to pairs of global feature vectors. Simultaneously, the VICReg criterion is applied to pairs of local feature vectors occurring before the last pooling layer. Two local feature vectors are attracted to each other if their l2-distance is below a threshold or if their relative locations are consistent with a known geometric transformation between the two input images. We demonstrate strong performance on linear classification and segmentation transfer tasks. Code and pretrained models are publicly available at: https://github.com/facebookresearch/VICRegL

VICRegL: Self-Supervised Learning of Local Visual Features

TL;DR

The paper addresses the need for self-supervised learning methods that capture both global image semantics and local spatial structure without segmentation masks.It introduces VICRegL, a two-branch framework that adds local feature matching (location- and embedding-based) to the VICReg global objective, enabling simultaneous learning of global and local representations.Through extensive experiments on ImageNet, Pascal VOC, Cityscapes, and ADE20k, VICRegL demonstrates strong segmentation gains while preserving classification performance, with ConvNeXt backbones yielding notable improvements.The work presents a detailed ablation analysis and qualitative visualizations, highlighting the trade-off between local and global cues and the robustness of mask-free local matching.

Abstract

Most recent self-supervised methods for learning image representations focus on either producing a global feature with invariance properties, or producing a set of local features. The former works best for classification tasks while the latter is best for detection and segmentation tasks. This paper explores the fundamental trade-off between learning local and global features. A new method called VICRegL is proposed that learns good global and local features simultaneously, yielding excellent performance on detection and segmentation tasks while maintaining good performance on classification tasks. Concretely, two identical branches of a standard convolutional net architecture are fed two differently distorted versions of the same image. The VICReg criterion is applied to pairs of global feature vectors. Simultaneously, the VICReg criterion is applied to pairs of local feature vectors occurring before the last pooling layer. Two local feature vectors are attracted to each other if their l2-distance is below a threshold or if their relative locations are consistent with a known geometric transformation between the two input images. We demonstrate strong performance on linear classification and segmentation transfer tasks. Code and pretrained models are publicly available at: https://github.com/facebookresearch/VICRegL
Paper Structure (19 sections, 6 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 19 sections, 6 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of VICRegL: Learning local and global features with VICReg. Given a seed image, two views are produced and fed to an encoder that produces local features. The features are further processed by a local projector that embed them into a smaller space, without destroying the localization information. Two matchings, one based on the spatial information provided by the transformation between the views, the other one based on the $l^2$-distance in the embedding space are computed, and the VICReg criterion is then applied between matched spatial embeddings. Additionally, the local features from the encoder are pooled together, and the pooled features are fed to a global expander. The VICReg criterion is finally applied between the two resulting embeddings.
  • Figure 2: Study of the trade-off between local and global criteria. Evaluation on linear classification on ImageNet and on linear Segmentation on Pascal VOC of VICRegL pretrained with various $\alpha$ coefficients of Eq. (\ref{['eq:loss']}), controlling the importance of the global criterion against the local criterion.
  • Figure 3: Selected matches: visualization of the locations of the best local matches selected by VICRegL. Left image is the seed image, with in red and blue the crop locations for the two views. Left column are the feature-based matches. Right column are the location-based matches. Only 10 matches are visualized for better clarity, but the actual number of selected matches is 20. We display the matches according to the location of the feature vectors in the feature maps. Note that the receptive field of these feature vectors is much larger than only the patch represented by one square of the grid in the figure. Best viewed in color with zoom.
  • Figure 4: Selected matches: visualization of the locations of the best local matches selected by VICRegL. Left image is the seed image, with in red and blue the crop locations for the two views. Left column are the $l^2$-distance based matches. Right column are the location based matches. Only 10 matches are visualized for better clarity, but the actual number of selected matches is 20. We display the matches according to the location of the feature vectors in the feature maps. Note that the receptive field of these feature vectors is much larger than only the patch represented by one square of the grid in the figure. Best viewed in color with zoom.
  • Figure 5: Selected matches: visualization of the locations of the best local matches selected by VICRegL. Left image is the seed image, with in red and blue the crop locations for the two views. Left column are the $l^2$-distance based matches. Right column are the location based matches. Only 10 matches are visualized for better clarity, but the actual number of selected matches is 20. We display the matches according to the location of the feature vectors in the feature maps. Note that the receptive field of these feature vectors is much larger than only the patch represented by one square of the grid in the figure. Best viewed in color with zoom.