Another BRIXEL in the Wall: Towards Cheaper Dense Features
Alexander Lappe, Martin A. Giese
TL;DR
BRIXEL tackles the challenge of producing high-resolution dense feature maps from vision transformers without prohibitive compute. It does so through a simple self-distillation setup in which a frozen high-resolution teacher guides a downsampled student via a trainable refiner and a convolutional head, optimizing a combination of $L_1$, edge, and spectral losses to match the teacher's dense features: $\mathcal{L}_{\text{total}}(\boldsymbol{\theta}) = \mathcal{L}_1(\boldsymbol{\theta}) + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}}(\boldsymbol{\theta}) + \lambda_{\text{spectral}} \mathcal{L}_{\text{spectral}}(\boldsymbol{\theta})$. Across multiple DINOv3 backbones and 110k high-resolution training images, BRIXEL consistently improves performance on both scene- and object-centric dense tasks at fixed resolution while dramatically reducing compute relative to full high-resolution inference. The method also extends to higher-density regimes and to alternative backbones like SigLIP 2, suggesting broad applicability. By enabling high-quality dense features with affordable resources, BRIXEL informs future dense-vision pretraining and efficient deployment of vision foundation models.
Abstract
Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. Moreover, it is able to produce feature maps that are very similar to those of the teacher at a fraction of the computational cost. Code and model weights are available at https://github.com/alexanderlappe/BRIXEL.
