M2fNet: Multi-modal Forest Monitoring Network on Large-scale Virtual Dataset
Yawen Lu, Yunhan Huang, Su Sun, Tansi Zhang, Xuewen Zhang, Songlin Fei, Yingjie Chen
TL;DR
This work tackles the domain gap in real-world forest monitoring by introducing a photorealistic virtual forest dataset and a multimodal transformer-based detector, M2fNet, that jointly processes RGB and depth data for instance-level tree detection and segmentation. The architecture uses dual Swin-transformer encoders, a fusion module, a high-resolution pixel decoder, and a Transformer decoder with tree queries to produce precise masks and boxes, optimized with BCE, Dice, and smooth-L1 losses. Empirical results show that synthetic data pre-training substantially improves performance on real forest images, with M2fNet outperforming RGB-only baselines; a domain-expert user study supports the realism and educational value of the simulation pipeline. The work contributes a scalable simulation-to-reality framework that can enhance forestry education, mapping, DBH estimation, and long-term tree tracking, facilitating safer, cheaper, and more extensive forestry analysis and training.
Abstract
Forest monitoring and education are key to forest protection, education and management, which is an effective way to measure the progress of a country's forest and climate commitments. Due to the lack of a large-scale wild forest monitoring benchmark, the common practice is to train the model on a common outdoor benchmark (e.g., KITTI) and evaluate it on real forest datasets (e.g., CanaTree100). However, there is a large domain gap in this setting, which makes the evaluation and deployment difficult. In this paper, we propose a new photorealistic virtual forest dataset and a multimodal transformer-based algorithm for tree detection and instance segmentation. To the best of our knowledge, it is the first time that a multimodal detection and segmentation algorithm is applied to large-scale forest scenes. We believe that the proposed dataset and method will inspire the simulation, computer vision, education, and forestry communities towards a more comprehensive multi-modal understanding.
