Table of Contents
Fetching ...

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, Dan Zhang

TL;DR

LoftUp tackles the limited spatial resolution of Vision Foundation Models by introducing a coordinate-based cross-attention transformer that upscales features to full resolution. It uses a two-stage, task-agnostic training regime: Stage 1 refines low-res features with class-agnostic masks to create high-frequency pseudo-GT, and Stage 2 applies self-distillation from a teacher network to further sharpen high-resolution outputs. The approach yields consistent gains across semantic segmentation, depth, and video object segmentation, while supporting arbitrary upsampling scales and maintaining efficiency. This work provides a practical, plug-and-play enhancement for VFMs, with strong empirical evidence and released code for broad accessibility.

Abstract

Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

TL;DR

LoftUp tackles the limited spatial resolution of Vision Foundation Models by introducing a coordinate-based cross-attention transformer that upscales features to full resolution. It uses a two-stage, task-agnostic training regime: Stage 1 refines low-res features with class-agnostic masks to create high-frequency pseudo-GT, and Stage 2 applies self-distillation from a teacher network to further sharpen high-resolution outputs. The approach yields consistent gains across semantic segmentation, depth, and video object segmentation, while supporting arbitrary upsampling scales and maintaining efficiency. This work provides a practical, plug-and-play enhancement for VFMs, with strong empirical evidence and released code for broad accessibility.

Abstract

Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.

Paper Structure

This paper contains 15 sections, 6 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: LoftUp improves significantly across various tasks over the VFM backbone (DINOv2-S oquab2024dinov) and current SoTA feature upsampling performance (FeatUp fu2024featup and LiFT suri2024lift). See experiment details in \ref{['sec:exp']}.
  • Figure 2: Comparison of features from upsamplers. Backbone is DINOv2-S/14 oquab2024dinov.
  • Figure 3: Architecture of LoftUp. Our coordinate-based network with cross-attention mechanism effectively integrates the fine-grained details from image RGB values and semantically-rich low-res features to produce high-resolution feature maps.
  • Figure 4: Our two-stage LoftUp training approach. Stage 1 trains an upsampler with class-agnostic masks to refine bicubic-upsampled features. Stage 2 employs self-distillation, initializing teacher and student upsamplers from Stage 1's pre-trained model. All VFM image inputs share the resolution ($H\times W$). For visual clarity, the VFM block is omitted from Stage 2’s teacher branch.
  • Figure 5: Visualization of different pseudo-GT. Both Mask-Bicubic and Self-Distilled are proposed by our work. We set $\alpha=0.8$ (in \ref{['eq:mask-adjustment']}) to balance sharp boundaries from masks and fine-grained details from high-res features.
  • ...and 8 more figures