Table of Contents
Fetching ...

HiT: Building Mapping with Hierarchical Transformers

Mingming Zhang, Qingjie Liu, Yunhong Wang

TL;DR

HiT tackles the problem of polygonal building mapping from high-resolution remote sensing images by proposing a two-stage detector with a polygon head that predicts serialized vertices using a Transformer. A bidirectional polygon loss and a hierarchical vertex/edge attention encoder enable order-invariant vertex prediction and rich geometric encoding, allowing end-to-end training and improved instance and polygonal quality. Experiments on CrowdAI and Inria Polygonized show state-of-the-art performance across both instance segmentation and polygonal metrics, with robust results in complex scenes. The approach offers practical impact for accurate, scalable digital mapping and geospatial analysis by directly producing vector building footprints with high geometric fidelity.

Abstract

Deep learning-based methods have been extensively explored for automatic building mapping from high-resolution remote sensing images over recent years. While most building mapping models produce vector polygons of buildings for geographic and mapping systems, dominant methods typically decompose polygonal building extraction in some sub-problems, including segmentation, polygonization, and regularization, leading to complex inference procedures, low accuracy, and poor generalization. In this paper, we propose a simple and novel building mapping method with Hierarchical Transformers, called HiT, improving polygonal building mapping quality from high-resolution remote sensing images. HiT builds on a two-stage detection architecture by adding a polygon head parallel to classification and bounding box regression heads. HiT simultaneously outputs building bounding boxes and vector polygons, which is fully end-to-end trainable. The polygon head formulates a building polygon as serialized vertices with the bidirectional characteristic, a simple and elegant polygon representation avoiding the start or end vertex hypothesis. Under this new perspective, the polygon head adopts a transformer encoder-decoder architecture to predict serialized vertices supervised by the designed bidirectional polygon loss. Furthermore, a hierarchical attention mechanism combined with convolution operation is introduced in the encoder of the polygon head, providing more geometric structures of building polygons at vertex and edge levels. Comprehensive experiments on two benchmarks (the CrowdAI and Inria datasets) demonstrate that our method achieves a new state-of-the-art in terms of instance segmentation and polygonal metrics compared with state-of-the-art methods. Moreover, qualitative results verify the superiority and effectiveness of our model under complex scenes.

HiT: Building Mapping with Hierarchical Transformers

TL;DR

HiT tackles the problem of polygonal building mapping from high-resolution remote sensing images by proposing a two-stage detector with a polygon head that predicts serialized vertices using a Transformer. A bidirectional polygon loss and a hierarchical vertex/edge attention encoder enable order-invariant vertex prediction and rich geometric encoding, allowing end-to-end training and improved instance and polygonal quality. Experiments on CrowdAI and Inria Polygonized show state-of-the-art performance across both instance segmentation and polygonal metrics, with robust results in complex scenes. The approach offers practical impact for accurate, scalable digital mapping and geospatial analysis by directly producing vector building footprints with high geometric fidelity.

Abstract

Deep learning-based methods have been extensively explored for automatic building mapping from high-resolution remote sensing images over recent years. While most building mapping models produce vector polygons of buildings for geographic and mapping systems, dominant methods typically decompose polygonal building extraction in some sub-problems, including segmentation, polygonization, and regularization, leading to complex inference procedures, low accuracy, and poor generalization. In this paper, we propose a simple and novel building mapping method with Hierarchical Transformers, called HiT, improving polygonal building mapping quality from high-resolution remote sensing images. HiT builds on a two-stage detection architecture by adding a polygon head parallel to classification and bounding box regression heads. HiT simultaneously outputs building bounding boxes and vector polygons, which is fully end-to-end trainable. The polygon head formulates a building polygon as serialized vertices with the bidirectional characteristic, a simple and elegant polygon representation avoiding the start or end vertex hypothesis. Under this new perspective, the polygon head adopts a transformer encoder-decoder architecture to predict serialized vertices supervised by the designed bidirectional polygon loss. Furthermore, a hierarchical attention mechanism combined with convolution operation is introduced in the encoder of the polygon head, providing more geometric structures of building polygons at vertex and edge levels. Comprehensive experiments on two benchmarks (the CrowdAI and Inria datasets) demonstrate that our method achieves a new state-of-the-art in terms of instance segmentation and polygonal metrics compared with state-of-the-art methods. Moreover, qualitative results verify the superiority and effectiveness of our model under complex scenes.
Paper Structure (17 sections, 11 equations, 10 figures, 8 tables)

This paper contains 17 sections, 11 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Different building mapping categorized into rasterized and polygonal mapping based on the output format. Rasterized mapping employs semantic or instance segmentation frameworks to obtain pixel-wise buildings shown in (a) and (b). Polygonal mapping is subdivided into multi-stage and single-stage pipelines based on whether to segment buildings explicitly. Multi-stage mapping typically adopts post-processing or multi-task learning for transforming pixel-wise masks to polygonal buildings shown in (c) and (d). Single-stage mapping designs serialized vertices or vertex connection prediction modules to obtain building serialized vertices directly shown in (e) and (f).
  • Figure 2: Overview of HiT. HiT is a two-stage building mapping framework, which includes classification, bounding box regression, and polygon heads. The polygon head predicts serialized vertices of a building, together with building detection. We introduce a novel bidirectional polygon loss to train the polygon head without complex constraints.
  • Figure 3: Illustration of the polygon head. The encoder with a hierarchical attention mechanism embeds more geometric information into the building feature. The decoder learns vertex queries to predict serialized vertices.
  • Figure 4: Illustration of vertex-level and edge-level attention operations. Vertex-level and edge-level attention replace the original self-attention mechanism to encode the building feature map, avoiding the complexity and speeding up the convergence speed by introducing the geometric information in terms of vertex and edge levels.
  • Figure 5: Illustration of the serialized vertices prediction loss $\textit{L}_{sv}$. (a)Search the corresponding vertex. (b)Shift the predicted serialized vertices. (c)Inverse the predicted serialized vertices.
  • ...and 5 more figures