Table of Contents
Fetching ...

Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Minseok Seo, Mark Hamilton, Changick Kim

TL;DR

Upsample Anything presents a training-free, test-time optimization framework that learns per-pixel anisotropic Gaussian kernels to upsample low-resolution foundation features into high-resolution, edge-aware outputs across 2D and 3D signals. By recasting Joint Bilateral Upsampling within Gaussian Splatting and optimizing kernels per image, the method delivers a universal, model-agnostic upsampling operator with fast runtime ($≈0.419\text{s}$ per $224\times224$ image) and robust performance on semantic segmentation, depth estimation, and depth-map upsampling. It eliminates the need for dataset-level retraining, enabling seamless transfer across backbones and tasks, and extends naturally to 3D feature volumes guided by RGB-D signals. Extensive experiments demonstrate state-of-the-art or near-SOTA results across multiple benchmarks, validating its practical impact for scalable, high-fidelity pixel- and voxel-level reconstruction.

Abstract

We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling. \textbf{Project page:} \href{https://seominseok0429.github.io/Upsample-Anything/}{https://seominseok0429.github.io/Upsample-Anything/}

Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

TL;DR

Upsample Anything presents a training-free, test-time optimization framework that learns per-pixel anisotropic Gaussian kernels to upsample low-resolution foundation features into high-resolution, edge-aware outputs across 2D and 3D signals. By recasting Joint Bilateral Upsampling within Gaussian Splatting and optimizing kernels per image, the method delivers a universal, model-agnostic upsampling operator with fast runtime ( per image) and robust performance on semantic segmentation, depth estimation, and depth-map upsampling. It eliminates the need for dataset-level retraining, enabling seamless transfer across backbones and tasks, and extends naturally to 3D feature volumes guided by RGB-D signals. Extensive experiments demonstrate state-of-the-art or near-SOTA results across multiple benchmarks, validating its practical impact for scalable, high-fidelity pixel- and voxel-level reconstruction.

Abstract

We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling. \textbf{Project page:} \href{https://seominseok0429.github.io/Upsample-Anything/}{https://seominseok0429.github.io/Upsample-Anything/}

Paper Structure

This paper contains 43 sections, 4 theorems, 18 equations, 12 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Fix $\sigma_s{>}0,\ \sigma_r{>}0$ and let $\Lambda=\Lambda(\sigma_s,\sigma_r)$. Then for any $p\in\Omega$, In particular, JBU coincides with evaluating a normalized Gaussian mixture in the lifted space $\mathbb{R}^{2+d}$ whose centers are $\{\phi(q)\}_{q\in\Omega(p)}$ and whose (isotropic-by-block) covariance is $\Lambda$.

Figures (12)

  • Figure 1: Our method performs lightweight test-time optimization ($\approx$0.419 s/image) without requiring any dataset-level training.It generalizes seamlessly across domains while maintaining consistent reconstruction quality for every image. (All examples are randomly selected, without cherry-picking.)
  • Figure 2: Comparison of dataset-level training and our test-time optimization (TTO). (a) Dataset-level methods (FeatUp, LoftUp, JAFAR, AnyUp) require paired training data and handle only 2D feature maps. (b) Our Upsample Anything performs TTO using only one HR image and generalizes to feature, depth, segmentation, and even 3D features.
  • Figure 3: Overview of Upsample Anything. Given a high-resolution image $I_{hr}$, we downsample it to $I_{lr}$ and optimize GSJBU to reconstruct $I_{hr}$, learning per-pixel anisotropic kernels $\{\sigma_x, \sigma_y, \theta, \sigma_r\}$ via test-time optimization (TTO). The learned kernels are then applied to foundation features $F_{lr}$ for rendering the high-resolution features $F_{hr}$, achieving pixel-wise anisotropic joint bilateral upsampling.
  • Figure 4: Depth upsampling results on Middlebury (top) and NYUv2 (bottom). 32×32 low-resolution depth maps were upsampled to high resolution using different methods. While Upsample Anything produces sharper and more detailed edges, it still achieves lower RMSE (0.237) than bilinear (0.159) on low-resolution maps. However, in high-resolution depth prediction, Upsample Anything outperforms both qualitatively and quantitatively.
  • Figure 5: Comparison across different resolutions. Qualitative results of AnyUp (previous SOTA) and our Upsample Anything on varying input resolutions.
  • ...and 7 more figures

Theorems & Definitions (7)

  • Theorem 1: JBU as a normalized Gaussian mixture in the joint domain
  • proof
  • Corollary 1: Discrete GS view in the joint domain
  • Theorem 2: Specialization of GSJBU to JBU (isotropic limit)
  • proof
  • Proposition 1: Discrete-to-continuous convergence
  • proof : Sketch