Table of Contents
Fetching ...

The P$^3$ dataset: Pixels, Points and Polygons for Multimodal Building Vectorization

Raphael Sulzer, Liuyun Duan, Nicolas Girard, Florent Lafarge

TL;DR

The paper introduces the P$^3$ dataset, a large-scale multimodal benchmark for building vectorization that jointly leverages high-resolution aerial imagery and aerial LiDAR point clouds to predict 2D building footprints. It provides data from the USA, Switzerland, and New Zealand across 638 km$^2$, containing ~10$^{10}$ LiDAR points and 25 cm RGB imagery, along with ground-truth 2D polygons in MS-COCO format. The authors benchmark three state-of-the-art vectorization methods (FFL, HiSup, Pix2Poly) in image-only, LiDAR-only, and multimodal fusion settings, and introduce comprehensive metrics including POLIS, HD, CD, and MTA beyond IoU-based measures. The study demonstrates that LiDAR improves polygon prediction, that fusing image and LiDAR yields the best results, and that the dataset is sufficiently challenging to motivate multimodal approaches; the data and pretrained models are publicly available for broader evaluation and reuse. The work highlights practical implications for scalable, accurate cadastral mapping and points to future work toward broader geographic coverage and richer annotations.

Abstract

We present the P$^3$ dataset, a large-scale multimodal benchmark for building vectorization, constructed from aerial LiDAR point clouds, high-resolution aerial imagery, and vectorized 2D building outlines, collected across three continents. The dataset contains over 10 billion LiDAR points with decimeter-level accuracy and RGB images at a ground sampling distance of 25 centimeter. While many existing datasets primarily focus on the image modality, P$^3$ offers a complementary perspective by also incorporating dense 3D information. We demonstrate that LiDAR point clouds serve as a robust modality for predicting building polygons, both in hybrid and end-to-end learning frameworks. Moreover, fusing aerial LiDAR and imagery further improves accuracy and geometric quality of predicted polygons. The P$^3$ dataset is publicly available, along with code and pretrained weights of three state-of-the-art models for building polygon prediction at https://github.com/raphaelsulzer/PixelsPointsPolygons .

The P$^3$ dataset: Pixels, Points and Polygons for Multimodal Building Vectorization

TL;DR

The paper introduces the P dataset, a large-scale multimodal benchmark for building vectorization that jointly leverages high-resolution aerial imagery and aerial LiDAR point clouds to predict 2D building footprints. It provides data from the USA, Switzerland, and New Zealand across 638 km, containing ~10 LiDAR points and 25 cm RGB imagery, along with ground-truth 2D polygons in MS-COCO format. The authors benchmark three state-of-the-art vectorization methods (FFL, HiSup, Pix2Poly) in image-only, LiDAR-only, and multimodal fusion settings, and introduce comprehensive metrics including POLIS, HD, CD, and MTA beyond IoU-based measures. The study demonstrates that LiDAR improves polygon prediction, that fusing image and LiDAR yields the best results, and that the dataset is sufficiently challenging to motivate multimodal approaches; the data and pretrained models are publicly available for broader evaluation and reuse. The work highlights practical implications for scalable, accurate cadastral mapping and points to future work toward broader geographic coverage and richer annotations.

Abstract

We present the P dataset, a large-scale multimodal benchmark for building vectorization, constructed from aerial LiDAR point clouds, high-resolution aerial imagery, and vectorized 2D building outlines, collected across three continents. The dataset contains over 10 billion LiDAR points with decimeter-level accuracy and RGB images at a ground sampling distance of 25 centimeter. While many existing datasets primarily focus on the image modality, P offers a complementary perspective by also incorporating dense 3D information. We demonstrate that LiDAR point clouds serve as a robust modality for predicting building polygons, both in hybrid and end-to-end learning frameworks. Moreover, fusing aerial LiDAR and imagery further improves accuracy and geometric quality of predicted polygons. The P dataset is publicly available, along with code and pretrained weights of three state-of-the-art models for building polygon prediction at https://github.com/raphaelsulzer/PixelsPointsPolygons .

Paper Structure

This paper contains 43 sections, 15 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Overview of P$^3$ . We collect aerial images, aerial LiDAR point clouds and vectorized building outlines from the USA, Switzerland and New Zealand . We harmonize and tile the data to create a large-scale benchmark dataset for building vectorization.
  • Figure 2: Example image/LiDAR tiles with polygon annotations. In (\ref{['fig:polygon_a']}), we show an example tile of Jersey, NY, US. The image exhibits distortion which leads to the "leaning building" effect. The polygon annotation is nonetheless accurately placed on the base of the building. The LiDAR acquisition does not suffer from the "leaning building" effect. In (\ref{['fig:polygon_b']}), we show an example tile of Zurich, Switzerland. The depicted building has two interior cutouts. The polygon annotation has thus two interior rings which are connected to the exterior ring at their closest vertex.
  • Figure 3: Modality ablation. We present a sample tile displaying predicted and ground truth building polygons from the Switzerland subset. The first column shows ground truth polygons, while subsequent columns show predicted polygons from baseline models trained on different input modalities, i.e. images only (first row), LiDAR only (second row), and the fusion of image and LiDAR data (third row). Note, in the bottom right corner, where a tree obscuring a building corner adversely affects the image-only prediction, while LiDAR-only and multimodal polygon predictions remain unaffected by this occlusion. Across all models, polygons predicted using multimodal inputs demonstrate superior simplicity and accuracy, especially with Pix2Poly.
  • Figure 4: Multimodal polygon prediction. We show ground truth and predicted building outlines from our full dataset, from top to bottom for Swizerland, New Zealand and the USA, with input LiDAR point clouds superimposed on aerial images. The first column shows the ground truth reference polygons, while the second to forth column present predicted polygons generated by FFL ffl, HiSup hisup and Pix2Poly pix2poly utilizing both input modalities.
  • Figure A.1: Ground sampling distance ablation. We present a sample tile displaying predicted and ground truth building polygons from the Switzerland subset with different point densities. The models are trained on images with a GSD of 15 and 25 cm which leads to different input tiles and predictions.
  • ...and 6 more figures