JAFAR: Jack up Any Feature at Any Resolution

Paul Couairon; Loick Chambon; Louis Serrano; Jean-Emmanuel Haugeard; Matthieu Cord; Nicolas Thome

JAFAR: Jack up Any Feature at Any Resolution

Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome

TL;DR

JAFAR introduces a lightweight, task-agnostic feature upsampler that upscales features from any foundation vision encoder to arbitrary output resolutions using a cross-attention mechanism. It uses asymmetric query and hybrid-key streams with spatial feature modulation to align high-resolution queries with semantically enriched keys, achieving sharp, boundary-aligned feature maps without high-resolution supervision. A simple, annotation-free training objective with multi-resolution views enables generalization from low-resolution training to much higher outputs. Across semantic segmentation, depth estimation, CAM fidelity, zero-shot segmentation, and Bird’s-Eye view tasks, JAFAR consistently outperforms prior upsampling methods while remaining computationally efficient, illustrating the potential for a unified, backbone-agnostic upsampling module. Limitations include global attention costs at large key sets, suggesting future work on localized or more scalable attention variants and backbone-independence at inference.

Abstract

Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io

JAFAR: Jack up Any Feature at Any Resolution

TL;DR

Abstract

JAFAR: Jack up Any Feature at Any Resolution

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)