Table of Contents
Fetching ...

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

Bocheng Zou, Mu Cai, Mark Stanley, Dingfu Lu, Yong Jae Lee

Abstract

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

Abstract

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

Paper Structure

This paper contains 35 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The "Recognition vs. Refinement" Dynamic. The feature map obtained when the input is resized to 266, 518, 784. At lower resolutions, the representation is globally coherent, enabling robust recognition. At higher resolutions, boundary details are sharper, enabling precise refinement, but the object's interior becomes noisy, risking incomplete segmentation. Our work is motivated by synergizing these two roles.
  • Figure 2: Overview of Multi-Resolution Fusion (MuRF). An input image is resized to multiple resolutions and each view is processed by a frozen DINOv2 encoder to produce separate feature maps. These features are upsampled to a shared spatial resolution and fused into a single multi-resolution representation, which can then be used by lightweight task-specific heads for semantic segmentation, depth estimation, visual question answering, visual grounding, and other downstream tasks. Background is removed from the PCA figures.
  • Figure 3: Qualitative comparison of semantic segmentation results on ADE20K (top) and PASCAL VOC (bottom) with different input resolutions. All images are resized to a square shape before being fed into DINOv2, and the subtitle above each image indicates the corresponding input resolution (side length in pixels).
  • Figure 4: Qualitative depth estimation results on NYUd (left) and SUN RGB-D (right). We compare single-scale DINOv2 predictions at 0.5×, 1.0×, and 1.5× input resolutions with our MuRF fusion. By aggregating multi-resolution features, MuRF better preserves global scene structure while sharpening local geometry, producing smoother and more accurate depth maps. Labels 0.X× indicate that the image fed into DINOv2 is resized to 0.X of the original image height and width.
  • Figure 5: The visualization of anomaly detection on MuRF AD 2 TESTpub dataset. Our merged result (MuRF) successfully combines the robust detection from low-resolution views (e.g., 0.3$\times$ correctly identifies the anomaly's presence but with a coarse mask) and the sharp boundaries from high-resolution views (e.g., 0.7$\times$).
  • ...and 1 more figures