Table of Contents
Fetching ...

Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies

In Seon Kim, Ali Moghimi

Abstract

Beyond the immediate biophysical benefits, urban trees play a foundational role in environmental sustainability and disaster mitigation. Precise mapping of urban trees is essential for environmental monitoring, post-disaster assessment, and strengthening policy. However, the transition from traditional, labor-intensive field surveys to scalable automated systems remains limited by high annotation costs and poor generalization across diverse urban scenarios. This study introduces a multimodal framework that integrates high-resolution satellite imagery with ground-level Google Street View to enable scalable and detailed urban tree detection under limited-annotation conditions. The framework first leverages satellite imagery to localize tree candidates and then retrieves targeted ground-level views for detailed detection, significantly reducing inefficient street-level sampling. To address the annotation bottleneck, domain adaptation is used to transfer knowledge from an existing annotated dataset to a new region of interest. To further minimize human effort, we evaluated three learning strategies: semi-supervised learning, active learning, and a hybrid approach combining both, using a transformer-based detection model. The hybrid strategy achieved the best performance with an F1-score of 0.90, representing a 12% improvement over the baseline model. In contrast, semi-supervised learning exhibited progressive performance degradation due to confirmation bias in pseudo-labeling, while active learning steadily improved results through targeted human intervention to label uncertain or incorrect predictions. Error analysis further showed that active and hybrid strategies reduced both false positives and false negatives. Our findings highlight the importance of a multimodal approach and guided annotation for scalable, annotation-efficient urban tree mapping to strengthen sustainable city planning.

Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies

Abstract

Beyond the immediate biophysical benefits, urban trees play a foundational role in environmental sustainability and disaster mitigation. Precise mapping of urban trees is essential for environmental monitoring, post-disaster assessment, and strengthening policy. However, the transition from traditional, labor-intensive field surveys to scalable automated systems remains limited by high annotation costs and poor generalization across diverse urban scenarios. This study introduces a multimodal framework that integrates high-resolution satellite imagery with ground-level Google Street View to enable scalable and detailed urban tree detection under limited-annotation conditions. The framework first leverages satellite imagery to localize tree candidates and then retrieves targeted ground-level views for detailed detection, significantly reducing inefficient street-level sampling. To address the annotation bottleneck, domain adaptation is used to transfer knowledge from an existing annotated dataset to a new region of interest. To further minimize human effort, we evaluated three learning strategies: semi-supervised learning, active learning, and a hybrid approach combining both, using a transformer-based detection model. The hybrid strategy achieved the best performance with an F1-score of 0.90, representing a 12% improvement over the baseline model. In contrast, semi-supervised learning exhibited progressive performance degradation due to confirmation bias in pseudo-labeling, while active learning steadily improved results through targeted human intervention to label uncertain or incorrect predictions. Error analysis further showed that active and hybrid strategies reduced both false positives and false negatives. Our findings highlight the importance of a multimodal approach and guided annotation for scalable, annotation-efficient urban tree mapping to strengthen sustainable city planning.

Paper Structure

This paper contains 38 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustration of the bearing angle calculation. The bearing angle $\theta$ from the panorama vehicle position $(p_e, p_n)$ to the tree target $(t_e, t_n)$ is computed in UTM coordinates, and passed as the bearing parameter to the Street View Static API to center the camera on the target tree.
  • Figure 2: Flowcharts for the three annotation-efficient learning strategies. (A) Flowchart for the semi-supervised pipeline. The model is trained on the labeled train/validation dataset and deployed on an unlabeled pool. Detections with confidence above 0.8 are automatically accepted as pseudo-labels and merged with the original dataset for retraining, while lower-confidence predictions are placed back in the unlabeled pool. (B) Active learning and hybrid learning strategies. In active learning, samples with prediction confidence below 0.5 are selected for human annotation. In the hybrid approach, predictions with confidence above 0.8 are automatically accepted as pseudo-labels (semi-supervised learning), while samples with confidence below 0.5 are assigned for human annotation (active learning) before merging and deduplication.
  • Figure 3: Precision, Recall, and F1-Score curve of the satellite model canopy detection. The best F1-score of 0.78 is achieved at a confidence threshold of 0.41. Lower thresholds increase recall at the cost of lower precision, while higher thresholds improve precision but reduce recall.
  • Figure 4: Performance comparison of three learning strategies over 10 rounds: Semi-Supervised learning (SS), Active Learning (AL), and hybrid AL + Semi-Supervised learning. (A) Precision trajectories across training rounds. (B) Recall trajectories across training rounds. (C) F1 score trajectories with peak performance markers (stars) for each method.
  • Figure 5: Error analysis and performance metric dynamics across the three learning strategies over iterative training rounds, including true positives (A), false positives (B), and false negatives (C).
  • ...and 2 more figures