TSE-Net: Semi-supervised Monocular Height Estimation from Single Remote Sensing Images
Sining Chen, Xiao Xiang Zhu
TL;DR
This work tackles monocular height estimation from single remote sensing images under scarce labeled data. It introduces TSE-Net, a semi-supervised self-training framework with a Teacher–Student–Exam architecture, where a multi-task teacher (regression+classification with Hierarchical Bi-Cut bins and Plackett–Luce calibration) generates pseudo-labels filtered by a ranking mechanism, and a student–exam pair learns from unlabeled data with EMA stabilization. The method achieves notable improvements across three datasets, especially at very low label ratios (e.g., 0.1%), and enhances balanced building height predictions by mitigating long-tailed biases. These results demonstrate the viability of semi-supervised regression in remote sensing and enable scalable 3D building modeling with limited annotations.
Abstract
Monocular height estimation plays a critical role in 3D perception for remote sensing, offering a cost-effective alternative to multi-view or LiDAR-based methods. While deep learning has significantly advanced the capabilities of monocular height estimation, these methods remain fundamentally limited by the availability of labeled data, which are expensive and labor-intensive to obtain at scale. The scarcity of high-quality annotations hinders the generalization and performance of existing models. To overcome this limitation, we propose leveraging large volumes of unlabeled data through a semi-supervised learning framework, enabling the model to extract informative cues from unlabeled samples and improve its predictive performance. In this work, we introduce TSE-Net, a self-training pipeline for semi-supervised monocular height estimation. The pipeline integrates teacher, student, and exam networks. The student network is trained on unlabeled data using pseudo-labels generated by the teacher network, while the exam network functions as a temporal ensemble of the student network to stabilize performance. The teacher network is formulated as a joint regression and classification model: the regression branch predicts height values that serve as pseudo-labels, and the classification branch predicts height value classes along with class probabilities, which are used to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to address the inherent long-tailed distribution of heights, and the predicted class probabilities are calibrated with a Plackett-Luce model to reflect the expected accuracy of pseudo-labels. We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities. Codes are available at https://github.com/zhu-xlab/tse-net.
