Table of Contents
Fetching ...

RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

Omar Alama, Darshil Jariwala, Avigyan Bhattacharya, Seungchan Kim, Wenshan Wang, Sebastian Scherer

TL;DR

RADSeg introduces a dense, language-aligned OVSS pipeline built on the RADIO backbone, combining Self-Correlating Recursive Attention and Self-Correlating Global Aggregation to enhance spatial locality and reduce artifacting. It leverages a lightweight SigLIP-based dense language alignment and a RADIO-SAM refinement path to produce higher-quality masks with substantially lower compute and parameter budgets than prior large-scale, multi-model baselines. Across 2D and 3D benchmarks, RADSeg achieves state-of-the-art or competitive $mIoU$ while delivering significant speedups (up to ~4x) and parameter reductions (several-fold), with RADSeg-base outperforming huge-model baselines. The work also provides the first empirical study of RADIO for zero-shot OVSS, underscoring its emergent dense language alignment and practical applicability for open-world perception tasks.

Abstract

Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (105M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.

RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

TL;DR

RADSeg introduces a dense, language-aligned OVSS pipeline built on the RADIO backbone, combining Self-Correlating Recursive Attention and Self-Correlating Global Aggregation to enhance spatial locality and reduce artifacting. It leverages a lightweight SigLIP-based dense language alignment and a RADIO-SAM refinement path to produce higher-quality masks with substantially lower compute and parameter budgets than prior large-scale, multi-model baselines. Across 2D and 3D benchmarks, RADSeg achieves state-of-the-art or competitive while delivering significant speedups (up to ~4x) and parameter reductions (several-fold), with RADSeg-base outperforming huge-model baselines. The work also provides the first empirical study of RADIO for zero-shot OVSS, underscoring its emergent dense language alignment and practical applicability for open-world perception tasks.

Abstract

Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (105M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.

Paper Structure

This paper contains 23 sections, 7 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Overview of the RADSeg pipeline. RGB sliding windows are processed by the RADIO backbone. Self-Correlating Recursive Attention (SCRA) computes a similarity matrix from these outputs, which is recursively fed back into the last attention block of RADIO. Feature windows are aggregated into a feature map and refined through Self-Correlating Global Aggregation (SCGA) to reduce noise and windowing artifacts. Features are language-aligned with the SigLIP CLS token adaptor, and predictions are made by comparing them with text embeddings. Optionally, masks can be further refined using RADIO-SAM, requiring only +20M additional parameters.
  • Figure 2: Qualitative comparison of last block attention and patch-wise similarity at different parts of the RADIO framework. The output of the RADIO backbone (Block $l$) can consistently attend to semantically similar patches, motivating our SCRA approach.
  • Figure 3: Qualitative 2D Open-Vocabulary Semantic Segmentation Results. For each of the five benchmark datasets, we show a representative example and compare RADSeg and RADSeg+ with competitive baselines (SC-CLIP, Talk2DINO, Trident, and TextRegion). Both RADSeg and RADSeg+ produce noticeably clearer and more accurate segmentation maps across all cases.
  • Figure 4: Qualitative 3D Open-Vocabulary Semantic Segmentation Results. We show two scenes: one from Replica ("chair", "table", "couch" classes), and one from ScanNet++ ("bed", "pillow", "monitor" classes). Segmented voxels are overlaid on the RGB for visualization. Across all 3D baselines, RADSeg provides more accurate segmentations with far fewer outlier voxels.
  • Figure A.1: Ablation study on different temperature factors $\tau_{scra}$ and $\tau_{scga}$. Left plot shows performance as we change $\tau_{scra}$ without SCGA. Right plot uses $\tau_{scra}=10$ and varies $\tau_{scga}$. Overall $\tau_{scra}=\tau_{scga}=10$ yield the best results on average.
  • ...and 3 more figures