Table of Contents
Fetching ...

What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

Moritz Pawlowsky, Antonis Vamvakeros, Alexander Weiss, Anja Bielefeld, Samuel J. Cooper, Ronan Docherty

Abstract

Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.

What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

Abstract

Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.
Paper Structure (42 sections, 2 equations, 23 figures, 6 tables)

This paper contains 42 sections, 2 equations, 23 figures, 6 tables.

Figures (23)

  • Figure 1: Summary of our contribution. (a) DINOv2's learned positional encoding (PE) leads to positionally-biased features, which causes poor segmentations when used in zero-shot segmentation of out-of-distribution images. (b) We remove the learned PE of a trained DINOv2 checkpoint, add 2D ALiBi PEs (based on relative token distances at each attention layer) and finetune to target the original embeddings. (c) This produces a model with more homogenous features and better resulting segmentations.
  • Figure 2: Linear probe analysis of DINOv2-S features. (a) We train linear probes to map from image features (or individual channels) to randomly sampled (red squares) ramp functions, reporting $R^2$ scores on holdout regions. Per-channel scores and predictions (which use all channels) are both averaged over a dataset of 15 homogenous microscopy images, i.e. images with no preferred direction. (b) The channels with the highest positional $R^2$ scores for a series of images, including a satellite image and an electron microscope image of a nickel superalloy NI_SUPERALLOY.
  • Figure 3: Per-channel per-layer 'positional fingerprint' of $R^2$ scores for DINOv2, DINOv3 and ALiBi-Dv2 for a left-right target ramp. DINOv2 begins with positional information spread across channels (its learned PE is added at the start of the network), which later decreases, whereas for DINOv3 the channels become more positional with layer depth (RoPE is applied at each layer). ALiBi-Dv2 has less positional information present across its channels and layers.
  • Figure 4: Feature PCA comparisons for DINOv2-S, DVT and our ALiBi-Dv2. ALiBi-Dv2 produces features which are semantically rich but that display less positional bias and artefacts. The token geometry is (locally) smooth, and in many cases similar to that of DVT. Of note is the 'square-circle' image, which retains 'objectness' across indistinguishable interior patches, the preserved vertical bias of the dog image (from depth-of-field) and the reduced positional biases on the X-ray CT cross-section of a pintype Li-ion battery BIL.
  • Figure 5: Feature PCA visualisations for DINOv2, DINOv3 and our ALiBi-DV2. We retain desirable object decompositions (e.g. head vs body of the dogs), whilst having less positional features for the satellite image and the SEM images of a battery cathode BIL & biphase steel BIPHASE_STEEL.
  • ...and 18 more figures