Table of Contents
Fetching ...

SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

Naomi Kombol, Ivan Martinović, Siniša Šegvić, Giorgos Tolias

Abstract

Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR

SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

Abstract

Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR

Paper Structure

This paper contains 20 sections, 4 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Performance vs. inference time trade-off. Comparison between the pre-trained SigLIP2 -- ViT-B-16 with single-pass or sliding-window (stride value reported in text labels) inference, and our single-pass SPAR-distilled model. The teacher has sliding windows of size $512\times512$ with stride 24. We report average performance across six datasets along with average inference time for an 1024$\times$2048 image. *ND: stride not divisible by patch size.
  • Figure 2: Overview of SPAR. During training, the teacher branch uses a frozen foundational vision encoder to generate feature maps via a sliding-window process followed by stitching. Stitching refers to merging the feature maps of overlapping windows into a unified representation aligned with the original image layout. The student branch, initialized from the same pre-trained weights, trains to match the teacher's output using efficient single-pass inference. At inference time, the student model enables fast and accurate segmentation at diverse resolutions and aspect ratios using a single forward pass.
  • Figure 3: Performance (left) and inference time (right) vs. image resolution. We compare SPAR to the pre-trained single-pass and sliding-window models, and single-pass NaFlex. SigLIP2 -- ViT-B-16 with native resolution $512\times512$, and $K=512$ used. Results on Cityscapes with resolution corresponding to the denoted area. Resolution 512 does not qualify for $K=512$. Single-pass models, including SPAR, have the same inference time. $\dagger$: larger resolution used in training (short side up to 2560 pixels instead of 2048). ALL: training the whole network, otherwise only the last 2 blocks.
  • Figure 4: Performance vs. resolution for DINOv3.txt.
  • Figure 5: Qualitative segmentation and PCA analysis results. We show that SPAR yields less noisy and spatially smoother predictions than the teacher, further improved by LPOSS stojnic2025lposs. Images are from VOC21 voc, Context60 context, ADE20K ade20k and Cityscapes cityscapes.
  • ...and 2 more figures