Table of Contents
Fetching ...

JAFAR: Jack up Any Feature at Any Resolution

Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome

TL;DR

JAFAR introduces a lightweight, task-agnostic feature upsampler that upscales features from any foundation vision encoder to arbitrary output resolutions using a cross-attention mechanism. It uses asymmetric query and hybrid-key streams with spatial feature modulation to align high-resolution queries with semantically enriched keys, achieving sharp, boundary-aligned feature maps without high-resolution supervision. A simple, annotation-free training objective with multi-resolution views enables generalization from low-resolution training to much higher outputs. Across semantic segmentation, depth estimation, CAM fidelity, zero-shot segmentation, and Bird’s-Eye view tasks, JAFAR consistently outperforms prior upsampling methods while remaining computationally efficient, illustrating the potential for a unified, backbone-agnostic upsampling module. Limitations include global attention costs at large key sets, suggesting future work on localized or more scalable attention variants and backbone-independence at inference.

Abstract

Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io

JAFAR: Jack up Any Feature at Any Resolution

TL;DR

JAFAR introduces a lightweight, task-agnostic feature upsampler that upscales features from any foundation vision encoder to arbitrary output resolutions using a cross-attention mechanism. It uses asymmetric query and hybrid-key streams with spatial feature modulation to align high-resolution queries with semantically enriched keys, achieving sharp, boundary-aligned feature maps without high-resolution supervision. A simple, annotation-free training objective with multi-resolution views enables generalization from low-resolution training to much higher outputs. Across semantic segmentation, depth estimation, CAM fidelity, zero-shot segmentation, and Bird’s-Eye view tasks, JAFAR consistently outperforms prior upsampling methods while remaining computationally efficient, illustrating the potential for a unified, backbone-agnostic upsampling module. Limitations include global attention costs at large key sets, suggesting future work on localized or more scalable attention variants and backbone-independence at inference.

Abstract

Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io

Paper Structure

This paper contains 48 sections, 10 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: JAFAR upsamples features from any foundation vision encoder to any image resolution, using the input image as high-resolution guidance. It generates sharp, boundary-aligned feature maps and serves as a versatile drop-in module for a variety of downstream tasks, including semantic segmentation, open-vocabulary segmentation, depth estimation, CAM evaluation, and bird’s-eye-view segmentation—consistently enhancing performance.
  • Figure 2: Overview of JAFAR. To construct the upsampling kernel, queries and keys are derived from a shared image representation. Queries are downsampled to match the target output resolution, while keys are downsampled to align with the spatial resolution of the vision encoder’s features. Keys are then semantically enriched via SFT modulation to promote semantic alignment between queries and keys. The resulting kernel is then used to interpolate features from the foundation vision encoder.
  • Figure 3: PCA Feature Visualization. DINOv2 ViT-S/14 features at $32 \times 32$ resolution from the ImageNet validation set are upsampled to $448 \times 448$. Baseline methods—whether training-free, task-dependent, or task-agnostic—introduce varying levels of blurriness and artifacts. Besides being task-agnostic, JAFAR produces sharp, content-aware feature maps with fewer artifacts.
  • Figure 4: Visual Comparison of Upsampler Outputs in Downstream Tasks. JAFAR-upsampled features produce sharper outputs that align more accurately with object boundaries across various downstream tasks respectively class activations maps, semantic segmentation and depth estimation.
  • Figure 5: PCA Feature Visualization. DINOv2 ViT-S/14 features at $32 \times 32$ resolution from the ImageNet validation set are upsampled to $448 \times 448$.
  • ...and 4 more figures