Table of Contents
Fetching ...

M2fNet: Multi-modal Forest Monitoring Network on Large-scale Virtual Dataset

Yawen Lu, Yunhan Huang, Su Sun, Tansi Zhang, Xuewen Zhang, Songlin Fei, Yingjie Chen

TL;DR

This work tackles the domain gap in real-world forest monitoring by introducing a photorealistic virtual forest dataset and a multimodal transformer-based detector, M2fNet, that jointly processes RGB and depth data for instance-level tree detection and segmentation. The architecture uses dual Swin-transformer encoders, a fusion module, a high-resolution pixel decoder, and a Transformer decoder with tree queries to produce precise masks and boxes, optimized with BCE, Dice, and smooth-L1 losses. Empirical results show that synthetic data pre-training substantially improves performance on real forest images, with M2fNet outperforming RGB-only baselines; a domain-expert user study supports the realism and educational value of the simulation pipeline. The work contributes a scalable simulation-to-reality framework that can enhance forestry education, mapping, DBH estimation, and long-term tree tracking, facilitating safer, cheaper, and more extensive forestry analysis and training.

Abstract

Forest monitoring and education are key to forest protection, education and management, which is an effective way to measure the progress of a country's forest and climate commitments. Due to the lack of a large-scale wild forest monitoring benchmark, the common practice is to train the model on a common outdoor benchmark (e.g., KITTI) and evaluate it on real forest datasets (e.g., CanaTree100). However, there is a large domain gap in this setting, which makes the evaluation and deployment difficult. In this paper, we propose a new photorealistic virtual forest dataset and a multimodal transformer-based algorithm for tree detection and instance segmentation. To the best of our knowledge, it is the first time that a multimodal detection and segmentation algorithm is applied to large-scale forest scenes. We believe that the proposed dataset and method will inspire the simulation, computer vision, education, and forestry communities towards a more comprehensive multi-modal understanding.

M2fNet: Multi-modal Forest Monitoring Network on Large-scale Virtual Dataset

TL;DR

This work tackles the domain gap in real-world forest monitoring by introducing a photorealistic virtual forest dataset and a multimodal transformer-based detector, M2fNet, that jointly processes RGB and depth data for instance-level tree detection and segmentation. The architecture uses dual Swin-transformer encoders, a fusion module, a high-resolution pixel decoder, and a Transformer decoder with tree queries to produce precise masks and boxes, optimized with BCE, Dice, and smooth-L1 losses. Empirical results show that synthetic data pre-training substantially improves performance on real forest images, with M2fNet outperforming RGB-only baselines; a domain-expert user study supports the realism and educational value of the simulation pipeline. The work contributes a scalable simulation-to-reality framework that can enhance forestry education, mapping, DBH estimation, and long-term tree tracking, facilitating safer, cheaper, and more extensive forestry analysis and training.

Abstract

Forest monitoring and education are key to forest protection, education and management, which is an effective way to measure the progress of a country's forest and climate commitments. Due to the lack of a large-scale wild forest monitoring benchmark, the common practice is to train the model on a common outdoor benchmark (e.g., KITTI) and evaluate it on real forest datasets (e.g., CanaTree100). However, there is a large domain gap in this setting, which makes the evaluation and deployment difficult. In this paper, we propose a new photorealistic virtual forest dataset and a multimodal transformer-based algorithm for tree detection and instance segmentation. To the best of our knowledge, it is the first time that a multimodal detection and segmentation algorithm is applied to large-scale forest scenes. We believe that the proposed dataset and method will inspire the simulation, computer vision, education, and forestry communities towards a more comprehensive multi-modal understanding.
Paper Structure (13 sections, 1 equation, 6 figures, 1 table)

This paper contains 13 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the application case using the proposed multimodal forest monitoring network with our newly simulated large-scale forest dataset. (a) Pipeline diagram showing how our application operates in generating virtual forest scenes and renderings for DL algorithm training. (b) The multi-modal forest network (M2fNet) takes multi-modality (RGB and depth) as input and learns instance-level boxes and masks as output results. The proposed network and dataset can be used in the following forest applications, including forest surveying and tracking, tree diameter measurement, and tree localization.
  • Figure 2: Examples of our simulated data. Top to bottom: Rendered RGB image, masked image, scene depth, and LiDAR point cloud.
  • Figure 3: Project setup for automatic data generation. RGB, depth and semantic cameras are used simultaneously to render forest scenes.
  • Figure 4: Three-stage data generation pipeline using Unreal Engine, including tree model preparation, scene generation and data rendering.
  • Figure 5: We design a BP_Tree class for randomly selecting tree species and models from our collected library.
  • ...and 1 more figures