PureForest: A Large-Scale Aerial Lidar and Aerial Imagery Dataset for Tree Species Classification in Monospecific Forests
Charles Gaydon, Floryne Roche
TL;DR
PureForest introduces the largest open, multimodal benchmark for tree species classification in monospecific forests, combining high-density aerial LiDAR and Very High Resolution imagery across 449 forests and 339 square kilometers. The dataset supports 18 tree species organized into 13 semantic classes and provides carefully curated polygon-based annotations, train/val/test splits, and 50-by-50 meter patches to enable robust evaluation. Baseline experiments show that LiDAR-based methods achieve strong performance (OA around 80% and mIoU around 55%), with colorized LiDAR offering limited gains and elevation context providing modest improvements; aerial imagery baselines are competitive but generally behind LiDAR in this setup. By releasing the dataset and baseline code, PureForest aims to advance deep learning for forest mapping, encourage multimodal fusion, and support reproducible, large-scale research in forest management and ecology.
Abstract
Knowledge of tree species distribution is fundamental to managing forests. New deep learning approaches promise significant accuracy gains for forest mapping, and are becoming a critical tool for mapping multiple tree species at scale. To advance the field, deep learning researchers need large benchmark datasets with high-quality annotations. To this end, we present the PureForest dataset: a large-scale, open, multimodal dataset designed for tree species classification from both Aerial Lidar Scanning (ALS) point clouds and Very High Resolution (VHR) aerial images. Most current public Lidar datasets for tree species classification have low diversity as they only span a small area of a few dozen annotated hectares at most. In contrast, PureForest has 18 tree species grouped into 13 semantic classes, and spans 339 km$^2$ across 449 distinct monospecific forests, and is to date the largest and most comprehensive Lidar dataset for the identification of tree species. By making PureForest publicly available, we hope to provide a challenging benchmark dataset to support the development of deep learning approaches for tree species identification from Lidar and/or aerial imagery. In this data paper, we describe the annotation workflow, the dataset, the recommended evaluation methodology, and establish a baseline performance from both 3D and 2D modalities.
