Table of Contents
Fetching ...

MultiScale Probability Map guided Index Pooling with Attention-based learning for Road and Building Segmentation

Shirsha Bose, Ritesh Sur Chowdhury, Debabrata Pal, Shivashish Bose, Biplab Banerjee, Subhasis Chaudhuri

TL;DR

This paper tackles the joint task of road network and building footprint segmentation from high-resolution satellite imagery, a problem challenged by semantic information loss during pooling. It introduces MSSDMPA-Net, a four-path dilated-encoder architecture that employs Dynamic Attention Map Guided Index Pooling (DAMIP) and Dynamic Attention Map Guided Spatial and Channel Attention (DAMSCA), guided by multi-scale supervised probability maps (DPMG) to preserve geometry and context during downsampling and upsampling. The approach yields state-of-the-art results across seven benchmarks, with extensive ablations validating the efficacy of DAMIP, DAMSCA, and deep supervision in improving both road connectivity and building boundary accuracy. The method promises robust, high-fidelity map extraction for applications in urban planning, disaster response, and autonomous systems, and points to future work in low-shot segmentation.

Abstract

Efficient road and building footprint extraction from satellite images are predominant in many remote sensing applications. However, precise segmentation map extraction is quite challenging due to the diverse building structures camouflaged by trees, similar spectral responses between the roads and buildings, and occlusions by heterogeneous traffic over the roads. Existing convolutional neural network (CNN)-based methods focus on either enriched spatial semantics learning for the building extraction or the fine-grained road topology extraction. The profound semantic information loss due to the traditional pooling mechanisms in CNN generates fragmented and disconnected road maps and poorly segmented boundaries for the densely spaced small buildings in complex surroundings. In this paper, we propose a novel attention-aware segmentation framework, Multi-Scale Supervised Dilated Multiple-Path Attention Network (MSSDMPA-Net), equipped with two new modules Dynamic Attention Map Guided Index Pooling (DAMIP) and Dynamic Attention Map Guided Spatial and Channel Attention (DAMSCA) to precisely extract the building footprints and road maps from remotely sensed images. DAMIP mines the salient features by employing a novel index pooling mechanism to retain important geometric information. On the other hand, DAMSCA simultaneously extracts the multi-scale spatial and spectral features. Besides, using dilated convolution and multi-scale deep supervision in optimizing MSSDMPA-Net helps achieve stellar performance. Experimental results over multiple benchmark building and road extraction datasets, ensures MSSDMPA-Net as the state-of-the-art (SOTA) method for building and road extraction.

MultiScale Probability Map guided Index Pooling with Attention-based learning for Road and Building Segmentation

TL;DR

This paper tackles the joint task of road network and building footprint segmentation from high-resolution satellite imagery, a problem challenged by semantic information loss during pooling. It introduces MSSDMPA-Net, a four-path dilated-encoder architecture that employs Dynamic Attention Map Guided Index Pooling (DAMIP) and Dynamic Attention Map Guided Spatial and Channel Attention (DAMSCA), guided by multi-scale supervised probability maps (DPMG) to preserve geometry and context during downsampling and upsampling. The approach yields state-of-the-art results across seven benchmarks, with extensive ablations validating the efficacy of DAMIP, DAMSCA, and deep supervision in improving both road connectivity and building boundary accuracy. The method promises robust, high-fidelity map extraction for applications in urban planning, disaster response, and autonomous systems, and points to future work in low-shot segmentation.

Abstract

Efficient road and building footprint extraction from satellite images are predominant in many remote sensing applications. However, precise segmentation map extraction is quite challenging due to the diverse building structures camouflaged by trees, similar spectral responses between the roads and buildings, and occlusions by heterogeneous traffic over the roads. Existing convolutional neural network (CNN)-based methods focus on either enriched spatial semantics learning for the building extraction or the fine-grained road topology extraction. The profound semantic information loss due to the traditional pooling mechanisms in CNN generates fragmented and disconnected road maps and poorly segmented boundaries for the densely spaced small buildings in complex surroundings. In this paper, we propose a novel attention-aware segmentation framework, Multi-Scale Supervised Dilated Multiple-Path Attention Network (MSSDMPA-Net), equipped with two new modules Dynamic Attention Map Guided Index Pooling (DAMIP) and Dynamic Attention Map Guided Spatial and Channel Attention (DAMSCA) to precisely extract the building footprints and road maps from remotely sensed images. DAMIP mines the salient features by employing a novel index pooling mechanism to retain important geometric information. On the other hand, DAMSCA simultaneously extracts the multi-scale spatial and spectral features. Besides, using dilated convolution and multi-scale deep supervision in optimizing MSSDMPA-Net helps achieve stellar performance. Experimental results over multiple benchmark building and road extraction datasets, ensures MSSDMPA-Net as the state-of-the-art (SOTA) method for building and road extraction.
Paper Structure (28 sections, 11 equations, 8 figures, 4 tables)

This paper contains 28 sections, 11 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of the different pooling mechanisms: Average pooling, Max pooling and Index pooling. Convolution of a feature map with different binary kernels produces multiple downsampled images. These kernels contain a value of 'one' at a unique index position, and the remaining are filled with 'zero'. Then, the downsampled features are concatenated channel-wise. The index pooling layer preserves all the feature information and eliminates the semantic information loss while performing the spatial dimension reduction. For ease of understanding, we have shown four kernels of size (2, 2) and stride value of 2 to downsample the feature map by selecting one unique indexed value in a 2 x 2 window.
  • Figure 2: The architecture of the proposed MSSDMPA-Net contains four major modules, namely, the multi-path encoder (${\mathcal{H}_i}$), DAMIP (${\mathcal{P}_i}$), DAMSCA (${\mathcal{A}_i}$), and decoder (${\mathcal{D}}$). Each of the four paths is denoted by $i$ where $i\epsilon\{1,2,3,4\}$. $j$ is the number of dilated convolution blocks where $j\epsilon\{1,2,3,4\}$. Each path of the four multi-path encoders comprises a feature encoder ($\mathcal{G}_i$) having several dilated convolution blocks connected in series with incremental dilation rate and a DPMG block ($\mathcal{F}_i$). Again, each of the three DAMIP modules ($\mathcal{P}_i$) at individual paths performs attention-based learning and downsamples the spatial resolution of the feature map. On the other hand, DAMSCA modules ($\mathcal{A}_i$) also perform attention-based learning but upsample the feature maps. Finally, the DAMSCA modules generated features are concatenated ($f_{dec}$) and passed to the decoder ($\mathcal{D}$) to generate a segmented probability map. Using multi-scale supervision, we jointly optimize the multi-path encoders generated features ($m_{i}$) and the decoder-produced probability map ($m_{out}$).
  • Figure 3: The architecture of Single level Encoder and its constituents, namely, Dilated convolution block and DPMG ($\mathcal{F}_i$). Inside the single level encoder in (a), several dilated convolution blocks, shown in (b) are connected serially with an increasing dilation rate $i.j.r$ to construct the feature encoder, followed by a DPMG module in (c), where $j$ denotes the dilated convolution block number of the $i^{th}$ path with $r$ as constant. The DPMG block processes the encoded features from the feature encoder to generate the segmented probability map, which helps in attention-based learning.
  • Figure 4: Illustration of the novel Index Pooling mechanism and its usage in DAMIP module $(\mathcal{P}_i)$ for the $i^{th}$ path of MSSDMPA-Net. (a) The Index pooling layer $(\mathcal{IP})$ downsamples the previous layer feature spatially by distributing its values across channel dimensions with binary kernels. Each kernel contains a single 'one' value at a unique position to mine the previous layer feature map value for a specific indexed location. For a $(2 \times 2)$ kernel with stride = 2, the Index Pooling layer reduces the spatial dimension by $(2,2)$ and increases the channel dimension by a magnitude of 4. Then, (b) using these spatially downsampled features, the DAMIP module performs attention-based learning to amplify salient information broadcasting.
  • Figure 5: Illustration of the novel spatio-spectral attention block, DAMSCA ($\mathcal{A}_i$). Using DPMG-generated segmented probability maps $(m_i)$, DAMSCA performs attention-learning in the spatial and channel dimensions. $(m_i)$ is directly applied to a multi-scale feature map $(f_i)$ for spatial attention. Whereas for channel attention, the probability map is convolved and average-pooled before performing channel-wise multiplication with $(f_i)$. Finally, spatial and channel attention outputs are added and upsampled to produce ($\tilde{f_i}$).
  • ...and 3 more figures