Table of Contents
Fetching ...

Center-guided Classifier for Semantic Segmentation of Remote Sensing Images

Wei Zhang, Mengting Ma, Yizhen Jiang, Rongrong Lian, Zhenkai Wu, Kangning Cui, Xiaowen Ma

TL;DR

CenterSeg addresses the problem of large intraclass variance in remote sensing image segmentation by replacing the standard parametric softmax with a center-guided classifier that uses multiple per-class prototypes derived from local class centers. Prototypes are generated via ground-truth-guided feature aggregation and hard-attention with momentum updates, and are regularized through two Grassmann-manifold terms: prototype-to-prototype orthogonality and feats-to-prototype alignment, improving both intra-class compactness and inter-class separability. The approach is plug-and-play, lightweight, and interpretable, achieving state-of-the-art or competitive results on Vaihingen, Potsdam, and LoveDA datasets while maintaining compatibility with existing RSI segmentation backbones. This work offers practical impact by providing a transparent prototype-based decision mechanism with minimal extra storage, enabling robust performance in high-intraclass-variance RSI tasks.

Abstract

Compared with natural images, remote sensing images (RSIs) have the unique characteristic. i.e., larger intraclass variance, which makes semantic segmentation for remote sensing images more challenging. Moreover, existing semantic segmentation models for remote sensing images usually employ a vanilla softmax classifier, which has three drawbacks: (1) non-direct supervision for the pixel representations during training; (2) inadequate modeling ability of parametric softmax classifiers under large intraclass variance; and (3) opaque process of classification decision. In this paper, we propose a novel classifier (called CenterSeg) customized for RSI semantic segmentation, which solves the abovementioned problems with multiple prototypes, direct supervision under Grassmann manifold, and interpretability strategy. Specifically, for each class, our CenterSeg obtains local class centers by aggregating corresponding pixel features based on ground-truth masks, and generates multiple prototypes through hard attention assignment and momentum updating. In addition, we introduce the Grassmann manifold and constrain the joint embedding space of pixel features and prototypes based on two additional regularization terms. Especially, during the inference, CenterSeg can further provide interpretability to the model by restricting the prototype as a sample of the training set. Experimental results on three remote sensing segmentation datasets validate the effectiveness of the model. Besides the superior performance, CenterSeg has the advantages of simplicity, lightweight, compatibility, and interpretability. Code is available at https://github.com/xwmaxwma/rssegmentation.

Center-guided Classifier for Semantic Segmentation of Remote Sensing Images

TL;DR

CenterSeg addresses the problem of large intraclass variance in remote sensing image segmentation by replacing the standard parametric softmax with a center-guided classifier that uses multiple per-class prototypes derived from local class centers. Prototypes are generated via ground-truth-guided feature aggregation and hard-attention with momentum updates, and are regularized through two Grassmann-manifold terms: prototype-to-prototype orthogonality and feats-to-prototype alignment, improving both intra-class compactness and inter-class separability. The approach is plug-and-play, lightweight, and interpretable, achieving state-of-the-art or competitive results on Vaihingen, Potsdam, and LoveDA datasets while maintaining compatibility with existing RSI segmentation backbones. This work offers practical impact by providing a transparent prototype-based decision mechanism with minimal extra storage, enabling robust performance in high-intraclass-variance RSI tasks.

Abstract

Compared with natural images, remote sensing images (RSIs) have the unique characteristic. i.e., larger intraclass variance, which makes semantic segmentation for remote sensing images more challenging. Moreover, existing semantic segmentation models for remote sensing images usually employ a vanilla softmax classifier, which has three drawbacks: (1) non-direct supervision for the pixel representations during training; (2) inadequate modeling ability of parametric softmax classifiers under large intraclass variance; and (3) opaque process of classification decision. In this paper, we propose a novel classifier (called CenterSeg) customized for RSI semantic segmentation, which solves the abovementioned problems with multiple prototypes, direct supervision under Grassmann manifold, and interpretability strategy. Specifically, for each class, our CenterSeg obtains local class centers by aggregating corresponding pixel features based on ground-truth masks, and generates multiple prototypes through hard attention assignment and momentum updating. In addition, we introduce the Grassmann manifold and constrain the joint embedding space of pixel features and prototypes based on two additional regularization terms. Especially, during the inference, CenterSeg can further provide interpretability to the model by restricting the prototype as a sample of the training set. Experimental results on three remote sensing segmentation datasets validate the effectiveness of the model. Besides the superior performance, CenterSeg has the advantages of simplicity, lightweight, compatibility, and interpretability. Code is available at https://github.com/xwmaxwma/rssegmentation.

Paper Structure

This paper contains 16 sections, 19 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Display of the remote sensing image characteristic. Images are selected from the LoveDA dataset. For example, the color and shape of the building class are significantly different between Scene $1$ and Scene $2$, as well as between region and region of the same scene, underscoring the characteristic of remote sensing images, i.e., large intraclass variance.
  • Figure 2: Architecture of the proposed CenterSeg, which consists of two key components: prototype generation and regularizer terms. At the training stage, the input image is firstly projected to a high-dimensional space after the encoder and decoder (ED) to obtain the pixel features $\mathcal{F}$. Next, we aggregate the features based on the ground-truth mask to obtain the class centers $\mathcal{S}$, i.e., the representative features of each class. Then, we generate the prototype $\mathcal{P}$ based on hard attention assignment and momentum update. In addition, two regularizer terms $\mathcal{L}_{pp}$ and $\mathcal{L}_{fp}$ are proposed to optimize the prototype generation. At the inference stage, classification decisions are performed directly based on the similarity of the pixel features to the prototype.
  • Figure 3: Illustration of Grassmann Manifold semantic space. Left shows that the space is constructed by category-aware basis concepts, where the subspace of each class could be regarded as a point on the Grassmann manifold. Right shows that the basis of subspaces representing classes are orthogonal to each other. By minimizing the distance between the projection matrices of two points on the Grassmann manifold, subspaces of different classes can be made far away from each other, thereby achieving better class separation.
  • Figure 4: The distance computation between pixels in $\mathcal{R}$ and the prototypes of ground-truth classes. Specifically, we apply a mask to the distance map (i.e., $\|\mathcal{F}-\mathcal{P}\|_2$) for obtaining the Euclidean distance between pixels to the corresponding prototype of class $k$, and we constrain the minimum Euclidean distance.
  • Figure 5: The distance computation between pixels in $\mathcal{R}$ and the prototypes of non-groundtruth classes. Specifically, we constrain the minimum distance from a pixel to a prototype of other class and hope it to be as large as possible.
  • ...and 6 more figures