Table of Contents
Fetching ...

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

Loick Chambon, Paul Couairon, Eloi Zablocki, Alexandre Boulch, Nicolas Thome, Matthieu Cord

TL;DR

NAF addresses the challenge of upsampling Vision Foundation Model features without retraining across diverse VFMs. It introduces Cross-Scale Neighborhood Attention with RoPE, where high-resolution guidance is derived from the input image and low-resolution features from the VFM are reweighted via a localized, frequency-domain kernel learned from data, effectively performing a data-adaptive inverse filtering akin to an IDFT. The method achieves state-of-the-art results across semantic segmentation, depth estimation, and downstream transfer tasks, while remaining efficient enough to upscale to very large feature maps and scale to 7B models in a zero-shot setting. In addition to upsampling, NAF demonstrates versatility in image restoration, highlighting its broader applicability as a general image-filtering primitive. The work provides a practical, interpretable, and broadly applicable module that significantly improves cross-VFM compatibility and performance with minimal training overhead.

Abstract

Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

TL;DR

NAF addresses the challenge of upsampling Vision Foundation Model features without retraining across diverse VFMs. It introduces Cross-Scale Neighborhood Attention with RoPE, where high-resolution guidance is derived from the input image and low-resolution features from the VFM are reweighted via a localized, frequency-domain kernel learned from data, effectively performing a data-adaptive inverse filtering akin to an IDFT. The method achieves state-of-the-art results across semantic segmentation, depth estimation, and downstream transfer tasks, while remaining efficient enough to upscale to very large feature maps and scale to 7B models in a zero-shot setting. In addition to upsampling, NAF demonstrates versatility in image restoration, highlighting its broader applicability as a general image-filtering primitive. The work provides a practical, interpretable, and broadly applicable module that significantly improves cross-VFM compatibility and performance with minimal training overhead.

Abstract

Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.

Paper Structure

This paper contains 56 sections, 26 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Neighborhood Attention Filtering (NAF) as a Zero-Shot Feature Upsampler: train once, apply efficiently to any Vision Foundation Model (including 7B models) to any scale, achieving state-of-the-art results across multiple downstream tasks.
  • Figure 2: NAF architecture allows to upsample low-resolution VFM features to any resolution, guided solely by the original high-resolution image.
  • Figure 3: Details of the dual-branch image encoder. NAF encoder considers both a pixel-wise branch and a local-contextual branch.
  • Figure 4: Illustration of the mean and channel-specific cosine and sine of $\Delta \Phi_c$. We compute the mean across all channels and select a single random channel to illustrate its individual behavior. For the cosine, we observe an overall decreasing pattern as the distance from the center increases.
  • Figure 5: Dot and cross products for a specific channel given a query point p on an image. We highlight the neighborhood around p using a dashed red square. On the feature side, after VFM-downsampling, we observe that implicitly NAF discriminates boundaries.
  • ...and 8 more figures