LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation

Junyang Chen; Xiangbo Lv; Zhiqiang Kou; Xingdong Sheng; Ning Xu; Yiguo Qiao

LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation

Junyang Chen, Xiangbo Lv, Zhiqiang Kou, Xingdong Sheng, Ning Xu, Yiguo Qiao

TL;DR

LoGoSeg tackles open-vocabulary semantic segmentation by addressing spatial-grounding gaps inherent in image-level supervision of vision-language models. It introduces three innovations—an adaptive object existence prior, region-aware region-text alignment, and a dual-stream fusion architecture—to unify local structural details with global semantic context in a single-stage framework. The method achieves strong performance and generalization across six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, PAS-20b) without external mask proposals or additional datasets, and exhibits near-linear gains when scaling backbone capacity. This work enhances cross-modal grounding efficiency and accuracy in cluttered or ambiguous scenes, enabling more reliable open-vocabulary segmentation in practical applications.

Abstract

Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.

LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation

TL;DR

Abstract

Paper Structure (29 sections, 13 equations, 3 figures, 5 tables)

This paper contains 29 sections, 13 equations, 3 figures, 5 tables.

Introduction
Method
Preliminary
Notation.
Prior-Guided Regional Alignment
Object Prior Estimation.
Region-Level Textual Guidance.
Region-Level Visual Guidance.
Region-aware Guidance Integration.
Contextual Cross-modal Fusion
Decoder and Loss Function
Decoder.
Loss Function.
Experiments
Datasets and Evaluation Metric
...and 14 more sections

Figures (3)

Figure 1: Comparison of open-vocabulary segmentation frameworks. (Top) Two-stage methods rely on external mask proposals, often causing hallucinations. (Middle) One-stage methods are more efficient but struggle with pixel-level grounding. (Bottom) LoGoSeg integrates object priors, region alignment, and dual fusion to improve cross-modal consistency and segmentation quality.
Figure 2: Overview of LoGoSeg. CLIP encoders extract multi-level visual and textual embeddings. A prior-guided alignment module uses a lightweight MLP to score regions for category-aware guidance and a Region Aligner to tightly align visual and textual features. A dual-branch fusion integrates local directional self-attention with global state-space modeling. A transformer with learnable queries then performs fine-grained cross-modal fusion via linear attention, and a hierarchical decoder with learnable upsampling and guidance-modulated convolutions yields the final segmentation map.
Figure 3: Qualitative comparison. Rows show ground truth, SED predictions, and LoGoSeg results. Region-aware alignment and dual-branch fusion yield more stable, less hallucinatory segmentations. LoGoSeg correctly identifies pizza and cat missing from the ground truth and avoids SED’s mislabels (e.g., vegetables as banana, paintings as TV).

LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation

TL;DR

Abstract

LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)