Table of Contents
Fetching ...

Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation

Jie Feng, Fengze Li, Junpeng Zhang, Siyu Chen, Yuping Liang, Junying Chen, Ronghua Shang

Abstract

Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.

Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation

Abstract

Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.

Paper Structure

This paper contains 14 sections, 13 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison of DINO fusion paradigms in RS open-vocabulary semantic segmentation. (a) Existing methods apply non-selective structural enhancement, resulting in diffused responses and coarse boundaries. (b) DR-Seg decouples the representation and selectively injects structural priors into the structure-dominated subspace, producing sharper boundaries while preserving semantic integrity.
  • Figure 2: Channel-wise impact analysis of CLIP representations on Potsdam. Top & Middle: The mIoU drop is highly uneven when individual channels are independently masked. By ranking the channels based on these drops, we categorize them into "positive" and "negative". Bottom: Activation maps further reveal distinct channel roles: dropping the top 20% negative channels sharpens spatial responses, whereas dropping the top 20% positive ones severely weakens discriminative activations.
  • Figure 3: Overall framework of DR-Seg. The proposed pipeline consists of three stages: Decouple, which uses SPSD separating CLIP features into semantics-dominated and structure-dominated subspaces via SPSD; Rectify, which selectively rectifies the structure-dominated subspace under DINO structural guidance through PDGR; and Fusion, which adaptively integrates the original multi-rotation CLIP branch and the refined branch via UGAF for final segmentation prediction.
  • Figure 4: Detailed illustration of PDGR. It constructs a sparse affinity map using DINO-based feature similarity and spatial proximity. The resulting sparse graph, visualized by white lines (top-$k$ neighbors), dynamically rectifies the structure-dominated subspace via prior-driven spatial routing.
  • Figure 5: Qualitative comparison with GSNet and RSKT-Seg. DR-Seg yields more accurate boundaries, cleaner category assignments, and better recovery of small or thin structures. The ViT-L model trained on DLRSD produces the results.
  • ...and 2 more figures