Table of Contents
Fetching ...

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

Dahyun Kang, Minsu Cho

TL;DR

This work introduces Lazy Visual Grounding (LaVG), a training-free, two-stage framework for open-vocabulary semantic segmentation. It first performs unsupervised object mask discovery via Panoptic cut, a Normalized Cut-based partitioning on self-supervised DINO features, and then grounds each discovered object to free-form text descriptions using cross-modal similarity with CLIP/SCLIP. By decoupling object discovery from text grounding and emphasizing late interaction, LaVG achieves state-of-the-art results on multiple OVSeg benchmarks while offering precise object boundaries and reduced spurious correlations. The approach challenges pixel-to-text grounding as the sole pathway for OVSeg and demonstrates that classic vision techniques, when combined with modern multi-modal embeddings, can deliver strong, training-free segmentation with practical impact in open-set contexts.

Abstract

We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding, for open-vocabulary semantic segmentation. Plenty of the previous art casts this task as pixel-to-text classification without object-level comprehension, leveraging the image-to-text classification capability of pretrained vision-and-language models. We argue that visual objects are distinguishable without the prior text information as segmentation is essentially a vision task. Lazy visual grounding first discovers object masks covering an image with iterative Normalized cuts and then later assigns text on the discovered objects in a late interaction manner. Our model requires no additional training yet shows great performance on five public datasets: Pascal VOC, Pascal Context, COCO-object, COCO-stuff, and ADE 20K. Especially, the visually appealing segmentation results demonstrate the model capability to localize objects precisely. Paper homepage: https://cvlab.postech.ac.kr/research/lazygrounding

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

TL;DR

This work introduces Lazy Visual Grounding (LaVG), a training-free, two-stage framework for open-vocabulary semantic segmentation. It first performs unsupervised object mask discovery via Panoptic cut, a Normalized Cut-based partitioning on self-supervised DINO features, and then grounds each discovered object to free-form text descriptions using cross-modal similarity with CLIP/SCLIP. By decoupling object discovery from text grounding and emphasizing late interaction, LaVG achieves state-of-the-art results on multiple OVSeg benchmarks while offering precise object boundaries and reduced spurious correlations. The approach challenges pixel-to-text grounding as the sole pathway for OVSeg and demonstrates that classic vision techniques, when combined with modern multi-modal embeddings, can deliver strong, training-free segmentation with practical impact in open-set contexts.

Abstract

We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding, for open-vocabulary semantic segmentation. Plenty of the previous art casts this task as pixel-to-text classification without object-level comprehension, leveraging the image-to-text classification capability of pretrained vision-and-language models. We argue that visual objects are distinguishable without the prior text information as segmentation is essentially a vision task. Lazy visual grounding first discovers object masks covering an image with iterative Normalized cuts and then later assigns text on the discovered objects in a late interaction manner. Our model requires no additional training yet shows great performance on five public datasets: Pascal VOC, Pascal Context, COCO-object, COCO-stuff, and ADE 20K. Especially, the visually appealing segmentation results demonstrate the model capability to localize objects precisely. Paper homepage: https://cvlab.postech.ac.kr/research/lazygrounding
Paper Structure (34 sections, 6 equations, 7 figures, 7 tables)

This paper contains 34 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of a pixel-grounding method and our object-grounding method. A pixel-grounding method (SCLIP sclip) often produces spurious correlations with imprecise edges. Unlike this pixel-to-text classification approach for open-vocabulary segmentation, we solve this task with lazy visual grounding: a two-stage approach of object mask discovery followed by object grounding in a late interaction manner.
  • Figure 2: Segmentation results of LaVG given the text description set. These precise object masks and visual grounding results are produced in a training-free fashion.
  • Figure 3: Two stages of Lazy Visual Grounding (LaVG). Given an image, LaVG first discovers existing object masks without the text information (panoptic cut) and then later assigns the class in text descriptions to each object with cross-modal similarity (object grounding).
  • Figure 4: The result of Normalized cut normalizedcut before and after refinement.
  • Figure 5: Qualitative comparison of our method and baselines on COCO-stuff 164K. The * mark on the text set denotes the false positive class prediction of our model.
  • ...and 2 more figures