Table of Contents
Fetching ...

FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feldman, Zhoutong Zhang, William T. Freeman

TL;DR

FeatUp addresses the challenge of marrying high semantic fidelity with fine spatial resolution in deep vision features. It introduces two model-agnostic upsampling pathways—a fast Joint Bilateral Upsampler (JBU) and a per-image implicit network—both trained under a multiview-consistency objective inspired by NeRF, formalized with a high-resolution target $F_{hr}$ and a reconstruction loss $\

Abstract

Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.

FeatUp: A Model-Agnostic Framework for Features at Any Resolution

TL;DR

FeatUp addresses the challenge of marrying high semantic fidelity with fine spatial resolution in deep vision features. It introduces two model-agnostic upsampling pathways—a fast Joint Bilateral Upsampler (JBU) and a per-image implicit network—both trained under a multiview-consistency objective inspired by NeRF, formalized with a high-resolution target and a reconstruction loss $\

Abstract

Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
Paper Structure (38 sections, 8 equations, 22 figures, 7 tables)

This paper contains 38 sections, 8 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: FeatUp upsamples image features from any model backbone, adding spatial resolution to existing semantics. High-res features can be learned either as a per-image implicit network or a general-purpose upsampling operation; the latter is a drop-in module to improve downstream dense prediction tasks.
  • Figure 2: The FeatUp training architecture. FeatUp learns to upsample features through a consistency loss on low resolution "views" of a model's features that arise from slight transformations of the input image.
  • Figure 3: We introduce two learned downsamplers. The simple downsampler (Left) is a fast learned blur kernel. The attention downsampler (right) combines a predicted salience map with spatially invariant kernels. This downsampler can better adapt to networks with nonlinear and dynamic receptive fields.
  • Figure 4: Our Implicit version of FeatUp learns an implicit network to upsample a single image's features. Our JBU FeatUp learns a stack of JBUs that learns to quickly upsample features from a large image corpora.
  • Figure 5: Low-res ViT features $(14\times14)$ from the COCO-Stuff validation set are upsampled by $16\times$. Bilinear and resize-conv baselines produce blurry outputs. Larger inputs and smaller transformer strides can help, but introduce noise or blur and are bound by time and memory constraints (We can only compute $8\times$ upsamplings for these methods, see Figure \ref{['fig:memory_time']}). Our FeatUp methods preserve semantics of the low-res features and recover lost spatial information from the high-res input image.
  • ...and 17 more figures