Table of Contents
Fetching ...

VibrantVS: A high-resolution multi-task transformer for forest canopy height estimation

Tony Chang, Kiarie Ndegwa, Andreas Gros, Vincent A. Landau, Luke J. Zachmann, Bogdan State, Mitchell A. Gritts, Colton W. Miller, Nathan E. Rutenbeck, Scott Conway, Guy Bayes

TL;DR

This work tackles the need for up-to-date, high-resolution forest canopy information to support wildfire risk mitigation and ecological management. It introduces VibrantVS, a high-resolution multi-task Vision Transformer trained on 4-band NAIP imagery to estimate canopy height models (CHMs) and canopy cover, and benchmarks it against three baselines (Meta, LANDFIRE, ETH) across 24 EPA Level 3 ecoregions in the western United States. VibrantVS achieves a clear accuracy and precision advantage, with a median $MAE$ of 2.71 m compared to 4.83–7.05 m for baselines, and provides 0.5 m CHMs, enabling updates every three years or less. The model leverages large, diverse training data, high-resolution inputs, and novel architectural enhancements to support downstream forest structure analyses (e.g., TAO segmentation) and wildfire risk modeling at high spatial fidelity, with practical implications for ecological monitoring and land management.

Abstract

This paper explores the application of a novel multi-task vision transformer (ViT) model for the estimation of canopy height models (CHMs) using 4-band National Agriculture Imagery Program (NAIP) imagery across the western United States. We compare the effectiveness of this model in terms of accuracy and precision aggregated across ecoregions and class heights versus three other benchmark peer-reviewed models. Key findings suggest that, while other benchmark models can provide high precision in localized areas, the VibrantVS model has substantial advantages across a broad reach of ecoregions in the western United States with higher accuracy, higher precision, the ability to generate updated inference at a cadence of three years or less, and high spatial resolution. The VibrantVS model provides significant value for ecological monitoring and land management decisions, including for wildfire mitigation.

VibrantVS: A high-resolution multi-task transformer for forest canopy height estimation

TL;DR

This work tackles the need for up-to-date, high-resolution forest canopy information to support wildfire risk mitigation and ecological management. It introduces VibrantVS, a high-resolution multi-task Vision Transformer trained on 4-band NAIP imagery to estimate canopy height models (CHMs) and canopy cover, and benchmarks it against three baselines (Meta, LANDFIRE, ETH) across 24 EPA Level 3 ecoregions in the western United States. VibrantVS achieves a clear accuracy and precision advantage, with a median of 2.71 m compared to 4.83–7.05 m for baselines, and provides 0.5 m CHMs, enabling updates every three years or less. The model leverages large, diverse training data, high-resolution inputs, and novel architectural enhancements to support downstream forest structure analyses (e.g., TAO segmentation) and wildfire risk modeling at high spatial fidelity, with practical implications for ecological monitoring and land management.

Abstract

This paper explores the application of a novel multi-task vision transformer (ViT) model for the estimation of canopy height models (CHMs) using 4-band National Agriculture Imagery Program (NAIP) imagery across the western United States. We compare the effectiveness of this model in terms of accuracy and precision aggregated across ecoregions and class heights versus three other benchmark peer-reviewed models. Key findings suggest that, while other benchmark models can provide high precision in localized areas, the VibrantVS model has substantial advantages across a broad reach of ecoregions in the western United States with higher accuracy, higher precision, the ability to generate updated inference at a cadence of three years or less, and high spatial resolution. The VibrantVS model provides significant value for ecological monitoring and land management decisions, including for wildfire mitigation.

Paper Structure

This paper contains 20 sections, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Sampling of tiles within Hydrologic Units 12 (HUC12) watersheds of the western United States that covers 24 EPA L3 ecoregions containing sufficient quality 3DEP lidar data and spatially/temporally intersecting NAIP data. Regions were additionally selected within WCS areas to optimize for model evaluation in regions where wildfire risk mitigation is a priority.
  • Figure 2: Sample tile counts within each of the randomly sampled train and test (approx. 85% to 15% ratio) groups by EPA L3 ecoregion.
  • Figure 3: Map of EPA L3 Ecoregions into which our test tiles were aggregated to evaluate baseline model performance.
  • Figure 4: VibrantVS multi-task vision transformer architecture with 4 band NAIP input to predict CHM and CC.
  • Figure 5: Box and whisker plots of test tile level MAE by model and ecoregion sorted by lowest to highest median MAE across all models.
  • ...and 10 more figures