Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
Minseok Seo, Mark Hamilton, Changick Kim
TL;DR
Upsample Anything presents a training-free, test-time optimization framework that learns per-pixel anisotropic Gaussian kernels to upsample low-resolution foundation features into high-resolution, edge-aware outputs across 2D and 3D signals. By recasting Joint Bilateral Upsampling within Gaussian Splatting and optimizing kernels per image, the method delivers a universal, model-agnostic upsampling operator with fast runtime ($≈0.419\text{s}$ per $224\times224$ image) and robust performance on semantic segmentation, depth estimation, and depth-map upsampling. It eliminates the need for dataset-level retraining, enabling seamless transfer across backbones and tasks, and extends naturally to 3D feature volumes guided by RGB-D signals. Extensive experiments demonstrate state-of-the-art or near-SOTA results across multiple benchmarks, validating its practical impact for scalable, high-fidelity pixel- and voxel-level reconstruction.
Abstract
We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling. \textbf{Project page:} \href{https://seominseok0429.github.io/Upsample-Anything/}{https://seominseok0429.github.io/Upsample-Anything/}
