Table of Contents
Fetching ...

Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation

ByeongCheol Lee, Hyun Seok Seong, Sangeek Hyun, Gilhan Park, WonJun Moon, Jae-Pil Heo

Abstract

A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Code is available at https://github.com/2btlFe/GLA-CLIP.

Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation

Abstract

A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Code is available at https://github.com/2btlFe/GLA-CLIP.
Paper Structure (41 sections, 13 equations, 13 figures, 12 tables)

This paper contains 41 sections, 13 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Comparison of segmentation consistency near window boundaries. (a) We evaluate segmentation inconsistency using the Boundary Error Rate (BER). BER is defined as the proportion of pixels near adjacent window boundaries where predicted labels differ despite identical ground-truth. ProxyCLIP yields high BER due to its lack of cross-window interaction, whereas BER is significantly reduced in ours by incorporating global context into attention process. More details in Appendix \ref{['sec:BER']}. (b) ProxyCLIP exhibits grid artifacts (marked with white circles), caused by the limited receptive field within individual windows. In contrast, ours mitigates these artifacts by leveraging contextual information beyond local windows.
  • Figure 2: Overview of our proposed framework. The input image is first divided into overlapping windows and processed using frozen backbones: a Vision Foundation Model (VFM, e.g., DINO) and CLIP. We introduce a Key-Value Token Extension, where VFM features from the current window $\mathbf{F}_{\text{vfm}}$ serve as query tokens, while key tokens are gathered from all windows to provide global context. The corresponding value tokens $V$ are extracted from the final transformer layer of CLIP. Cross-attention is then applied, followed by a projection layer to generate the final visual features $\mathbf{F}_{\text{visual}}$. To stabilize attention across windows, each query token is replaced with a semantically representative proxy anchor. Finally, a dynamic normalization scheme adjusts attention strength based on object size, approximated by the number of positive samples associated with each proxy anchor.
  • Figure 3: Visualization of attention maps for an anchor query token. Proxy-based attention enhances focus on semantically relevant regions across both inner- and outer-window areas. The subsequent dynamic normalization further suppresses irrelevant responses, especially from noisy tokens, yielding sharper and more semantically consistent attention distributions.
  • Figure 4: Qualitative results among ProxyCLIP lan2024proxyclip, CASS CASS, Ours on Pascal VOC21 pascalvoc, COCOstuff coco, and Cityscapes cordts2016cityscapes.
  • Figure 5: Class-wise object scale and 1/w in Cityscapes
  • ...and 8 more figures