P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

Qi Zhang; Guohua Geng; Longquan Yan; Pengbo Zhou; Zhaodi Li; Kang Li; Qinglin Liu

P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

Qi Zhang, Guohua Geng, Longquan Yan, Pengbo Zhou, Zhaodi Li, Kang Li, Qinglin Liu

TL;DR

The paper tackles semantic segmentation of high-resolution remote sensing imagery by addressing multi-scale context and long-range dependencies within diffusion-based models. It introduces P-MSDiff, a parallel multi-scale diffusion framework built on a WNetFormer backbone and augmented with a plug-and-play Cross-Bridge Linear Attention module to enhance cross-scale feature integration. The authors demonstrate state-of-the-art or competitive performance on UAVid and Vaihingen datasets, notably improving small-target segmentation and boundary delineation while maintaining robust performance across classes. This approach advances diffusion-based segmentation for remote sensing by enabling efficient multi-scale denoising and effective attention, with practical implications for aerial mapping and urban analysis.

Abstract

Diffusion models and multi-scale features are essential components in semantic segmentation tasks that deal with remote-sensing images. They contribute to improved segmentation boundaries and offer significant contextual information. U-net-like architectures are frequently employed in diffusion models for segmentation tasks. These architectural designs include dense skip connections that may pose challenges for interpreting intermediate features. Consequently, they might not efficiently convey semantic information throughout various layers of the encoder-decoder architecture. To address these challenges, we propose a new model for semantic segmentation known as the diffusion model with parallel multi-scale branches. This model consists of Parallel Multiscale Diffusion modules (P-MSDiff) and a Cross-Bridge Linear Attention mechanism (CBLA). P-MSDiff enhances the understanding of semantic information across multiple levels of granularity and detects repetitive distribution data through the integration of recursive denoising branches. It further facilitates the amalgamation of data by connecting relevant branches to the primary framework to enable concurrent denoising. Furthermore, within the interconnected transformer architecture, the LA module has been substituted with the CBLA module. This module integrates a semidefinite matrix linked to the query into the dot product computation of keys and values. This integration enables the adaptation of queries within the LA framework. This adjustment enhances the structure for multi-head attention computation, leading to enhanced network performance and CBLA is a plug-and-play module. Our model demonstrates superior performance based on the J1 metric on both the UAVid and Vaihingen Building datasets, showing improvements of 1.60% and 1.40% over strong baseline models, respectively.

P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

TL;DR

Abstract

Paper Structure (30 sections, 22 equations, 6 figures, 3 tables)

This paper contains 30 sections, 22 equations, 6 figures, 3 tables.

Introduction
Related Work
Remote Sensing Image Semantic Segmentation
Diffusion Model
Multi-Branch Structure
Linear Attention (LA)
Methods
Overview
Background Of Semantic Segmentation Diffusion
Parallel Multi-Scale Diffusion (P-MSDiff)
Cross-Bridge Linear Attention(CBLA)
Loss Function
Experiments
Experimental setting
DATASETS
...and 15 more sections

Figures (6)

Figure 1: The overall framework of the P-MSDiff network. The network architecture comprises the WNetFormer core in the upper section and smaller parallel computational branches in the lower section. Feature fusion at consistent scales is denoted by black arrows. The Cross-Bridge Linear Attention (CBLA) mechanism is integrated into each module within the intermediary encoding and decoding structures to improve self-attention.
Figure 2: Training process of remote sensing image diffusion models. The central structure applies $T$ steps of noise diffusion to the ground truth, consequently utilizing the diffusion model for $T$ steps of denoising training.
Figure 3: A comparison of the LA module and the CBLA module. Each box symbolizes an input, output, or intermediate matrix. The symbol $\rho$ is utilized to indicate normalization through the softmax function. $N$, $d$, $d_k$, and $d_v$ denote the magnitude of the input and the dimensions of the input, keys, and values, respectively.
Figure 4: Visualization on the Vaihingen Buildings dataset. Our method exhibits more standard edges and achieves results closer to the ground truth labels in the recognition of main buildings, presenting a more precise visual performance compared to CNN networks and other diffusion models.
Figure 5: The segmentation results on the UAVid dataset. The experimental results of this approach are generally similar to RNDiff, showing consistent segmentation errors in large-scale classes. However, in the detection of small-scale objects such as Human, Moving Car and Static Car, it exhibits more precise semantic labeling.
...and 1 more figures

P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

TL;DR

Abstract

P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)