Table of Contents
Fetching ...

Ultra-High Resolution Segmentation via Boundary-Enhanced Patch-Merging Transformer

Haopeng Sun, Yingwei Zhang, Lumin Xu, Sheng Jin, Yiqiang Chen

TL;DR

This work addresses the challenge of semantic segmentation for ultra-high-resolution (UHR) imagery by balancing global context and fine-grained details within a single-branch architecture. It introduces the Boundary-Enhanced Patch-Merging Transformer (BPT), combining the Patch-Merging Transformer (PMT) for dynamic token allocation with the Boundary-Enhanced Module (BEM) for training-time boundary refinement. The learning objective fuses semantic, boundary, and final supervision as $L_{Total} = \lambda_1 L_{Semantic} + \lambda_2 L_{Boundary} + \lambda_3 L_{Final}$, with $L_{Semantic} = \alpha_1 L_{FL} + \beta_1 L_{RL}$, $L_{Boundary} = \alpha_2 L_{DL} + \beta_2 L_{BCE}$, and $L_{Final} = \alpha_3 L_{FL} + \beta_3 L_{CE}$. Experiments on five public UHR benchmarks demonstrate state-of-the-art accuracy with no extra inference cost, validating the effectiveness of PMT in capturing both global context and local details and of BEM in refining boundaries.

Abstract

Segmentation of ultra-high resolution (UHR) images is a critical task with numerous applications, yet it poses significant challenges due to high spatial resolution and rich fine details. Recent approaches adopt a dual-branch architecture, where a global branch learns long-range contextual information and a local branch captures fine details. However, they struggle to handle the conflict between global and local information while adding significant extra computational cost. Inspired by the human visual system's ability to rapidly orient attention to important areas with fine details and filter out irrelevant information, we propose a novel UHR segmentation method called Boundary-enhanced Patch-merging Transformer (BPT). BPT consists of two key components: (1) Patch-Merging Transformer (PMT) for dynamically allocating tokens to informative regions to acquire global and local representations, and (2) Boundary-Enhanced Module (BEM) that leverages boundary information to enrich fine details. Extensive experiments on multiple UHR image segmentation benchmarks demonstrate that our BPT outperforms previous state-of-the-art methods without introducing extra computational overhead. Codes will be released to facilitate research.

Ultra-High Resolution Segmentation via Boundary-Enhanced Patch-Merging Transformer

TL;DR

This work addresses the challenge of semantic segmentation for ultra-high-resolution (UHR) imagery by balancing global context and fine-grained details within a single-branch architecture. It introduces the Boundary-Enhanced Patch-Merging Transformer (BPT), combining the Patch-Merging Transformer (PMT) for dynamic token allocation with the Boundary-Enhanced Module (BEM) for training-time boundary refinement. The learning objective fuses semantic, boundary, and final supervision as , with , , and . Experiments on five public UHR benchmarks demonstrate state-of-the-art accuracy with no extra inference cost, validating the effectiveness of PMT in capturing both global context and local details and of BEM in refining boundaries.

Abstract

Segmentation of ultra-high resolution (UHR) images is a critical task with numerous applications, yet it poses significant challenges due to high spatial resolution and rich fine details. Recent approaches adopt a dual-branch architecture, where a global branch learns long-range contextual information and a local branch captures fine details. However, they struggle to handle the conflict between global and local information while adding significant extra computational cost. Inspired by the human visual system's ability to rapidly orient attention to important areas with fine details and filter out irrelevant information, we propose a novel UHR segmentation method called Boundary-enhanced Patch-merging Transformer (BPT). BPT consists of two key components: (1) Patch-Merging Transformer (PMT) for dynamically allocating tokens to informative regions to acquire global and local representations, and (2) Boundary-Enhanced Module (BEM) that leverages boundary information to enrich fine details. Extensive experiments on multiple UHR image segmentation benchmarks demonstrate that our BPT outperforms previous state-of-the-art methods without introducing extra computational overhead. Codes will be released to facilitate research.

Paper Structure

This paper contains 22 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: (a) Existing methods representing images as standard grids of pixels are sub-optimal for UHR segmentation. (b) Dual-branch framework preserves both global and local information at the cost of increased computation. (c) Our proposed model captures both global and local information by dynamically allocating tokens to informative regions (PMT) and leveraging boundary information (BEM).
  • Figure 2: (a) Overview of Boundary-Enhanced Patch-Merging Transformer (BPT), which consists of PMT and BEM. Dotted lines represent that only needed during the training phase. (b) Patch Recovering Block, (c) Patch Feature Extraction, (d) Boundary & Seg Head, (e) Patch Merging Block, (f) Feature Fusion Module.
  • Figure 3: Qualitative analysis on the DeepGlobe dataset. (a) Source image. (b) Patch tokens generated by PMT. (c) Ground-truth mask. (d) Results of GPWFormer. (e) Results of BPT (ours).