Table of Contents
Fetching ...

Another BRIXEL in the Wall: Towards Cheaper Dense Features

Alexander Lappe, Martin A. Giese

TL;DR

BRIXEL tackles the challenge of producing high-resolution dense feature maps from vision transformers without prohibitive compute. It does so through a simple self-distillation setup in which a frozen high-resolution teacher guides a downsampled student via a trainable refiner and a convolutional head, optimizing a combination of $L_1$, edge, and spectral losses to match the teacher's dense features: $\mathcal{L}_{\text{total}}(\boldsymbol{\theta}) = \mathcal{L}_1(\boldsymbol{\theta}) + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}}(\boldsymbol{\theta}) + \lambda_{\text{spectral}} \mathcal{L}_{\text{spectral}}(\boldsymbol{\theta})$. Across multiple DINOv3 backbones and 110k high-resolution training images, BRIXEL consistently improves performance on both scene- and object-centric dense tasks at fixed resolution while dramatically reducing compute relative to full high-resolution inference. The method also extends to higher-density regimes and to alternative backbones like SigLIP 2, suggesting broad applicability. By enabling high-quality dense features with affordable resources, BRIXEL informs future dense-vision pretraining and efficient deployment of vision foundation models.

Abstract

Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. Moreover, it is able to produce feature maps that are very similar to those of the teacher at a fraction of the computational cost. Code and model weights are available at https://github.com/alexanderlappe/BRIXEL.

Another BRIXEL in the Wall: Towards Cheaper Dense Features

TL;DR

BRIXEL tackles the challenge of producing high-resolution dense feature maps from vision transformers without prohibitive compute. It does so through a simple self-distillation setup in which a frozen high-resolution teacher guides a downsampled student via a trainable refiner and a convolutional head, optimizing a combination of , edge, and spectral losses to match the teacher's dense features: . Across multiple DINOv3 backbones and 110k high-resolution training images, BRIXEL consistently improves performance on both scene- and object-centric dense tasks at fixed resolution while dramatically reducing compute relative to full high-resolution inference. The method also extends to higher-density regimes and to alternative backbones like SigLIP 2, suggesting broad applicability. By enabling high-quality dense features with affordable resources, BRIXEL informs future dense-vision pretraining and efficient deployment of vision foundation models.

Abstract

Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. Moreover, it is able to produce feature maps that are very similar to those of the teacher at a fraction of the computational cost. Code and model weights are available at https://github.com/alexanderlappe/BRIXEL.

Paper Structure

This paper contains 20 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Recent dense feature extractors are able to operate at very high resolution, albeit at great computational cost. We propose BRIXEL, a simple self-distillation approach that produces dense feature maps while circumventing the Vision Transformer's quadratic scaling.
  • Figure 2: An overview of BRIXEL. The teacher and student network share both architecture and weights, which are all frozen. During training, the student receives a downsampled input image and has to reconstruct the dense features computed by the high-resolution teacher model. To do so, the student is connected to a standard ViT adapter and feeds into a convolutional readout head which fuses the output of the frozen student backbone and the trainable ViT adapter.
  • Figure 3: Qualitative evaluation of the proposed method. The second and third column display the dense feature maps of DINOv3 when feeding in the input image at different resolutions. At 256 pixels per side, feature maps become very blurry. The final column shows the dense feature maps of DINOv3 when combined with BRIXEL. Even though we also input images at 256 pixels per side, the feature maps are visually almost indistinguishable from those computed with 4096 tokens at considerably higher computational cost. As has become standard practice, we create the visualizations by performing a singular value decomposition on the 4096 tokens of the high-resolution target feature map. Then, we project all tokens of all images in the same row onto the first three singular vectors and map the results to RGB values.
  • Figure 4: We compare the computational cost of generating dense features of size 64x64 for a single image using the baseline DINOv3 (1024 pixels) and the proposed method (256 pixels). Runtime is measured on an NVIDIA A100.
  • Figure 5: Feature maps of the ViT-B model fine-tuned and evaluated at an image size of 480x480. Best viewed on screen using zoom.
  • ...and 3 more figures