Table of Contents
Fetching ...

TSE-Net: Semi-supervised Monocular Height Estimation from Single Remote Sensing Images

Sining Chen, Xiao Xiang Zhu

TL;DR

This work tackles monocular height estimation from single remote sensing images under scarce labeled data. It introduces TSE-Net, a semi-supervised self-training framework with a Teacher–Student–Exam architecture, where a multi-task teacher (regression+classification with Hierarchical Bi-Cut bins and Plackett–Luce calibration) generates pseudo-labels filtered by a ranking mechanism, and a student–exam pair learns from unlabeled data with EMA stabilization. The method achieves notable improvements across three datasets, especially at very low label ratios (e.g., 0.1%), and enhances balanced building height predictions by mitigating long-tailed biases. These results demonstrate the viability of semi-supervised regression in remote sensing and enable scalable 3D building modeling with limited annotations.

Abstract

Monocular height estimation plays a critical role in 3D perception for remote sensing, offering a cost-effective alternative to multi-view or LiDAR-based methods. While deep learning has significantly advanced the capabilities of monocular height estimation, these methods remain fundamentally limited by the availability of labeled data, which are expensive and labor-intensive to obtain at scale. The scarcity of high-quality annotations hinders the generalization and performance of existing models. To overcome this limitation, we propose leveraging large volumes of unlabeled data through a semi-supervised learning framework, enabling the model to extract informative cues from unlabeled samples and improve its predictive performance. In this work, we introduce TSE-Net, a self-training pipeline for semi-supervised monocular height estimation. The pipeline integrates teacher, student, and exam networks. The student network is trained on unlabeled data using pseudo-labels generated by the teacher network, while the exam network functions as a temporal ensemble of the student network to stabilize performance. The teacher network is formulated as a joint regression and classification model: the regression branch predicts height values that serve as pseudo-labels, and the classification branch predicts height value classes along with class probabilities, which are used to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to address the inherent long-tailed distribution of heights, and the predicted class probabilities are calibrated with a Plackett-Luce model to reflect the expected accuracy of pseudo-labels. We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities. Codes are available at https://github.com/zhu-xlab/tse-net.

TSE-Net: Semi-supervised Monocular Height Estimation from Single Remote Sensing Images

TL;DR

This work tackles monocular height estimation from single remote sensing images under scarce labeled data. It introduces TSE-Net, a semi-supervised self-training framework with a Teacher–Student–Exam architecture, where a multi-task teacher (regression+classification with Hierarchical Bi-Cut bins and Plackett–Luce calibration) generates pseudo-labels filtered by a ranking mechanism, and a student–exam pair learns from unlabeled data with EMA stabilization. The method achieves notable improvements across three datasets, especially at very low label ratios (e.g., 0.1%), and enhances balanced building height predictions by mitigating long-tailed biases. These results demonstrate the viability of semi-supervised regression in remote sensing and enable scalable 3D building modeling with limited annotations.

Abstract

Monocular height estimation plays a critical role in 3D perception for remote sensing, offering a cost-effective alternative to multi-view or LiDAR-based methods. While deep learning has significantly advanced the capabilities of monocular height estimation, these methods remain fundamentally limited by the availability of labeled data, which are expensive and labor-intensive to obtain at scale. The scarcity of high-quality annotations hinders the generalization and performance of existing models. To overcome this limitation, we propose leveraging large volumes of unlabeled data through a semi-supervised learning framework, enabling the model to extract informative cues from unlabeled samples and improve its predictive performance. In this work, we introduce TSE-Net, a self-training pipeline for semi-supervised monocular height estimation. The pipeline integrates teacher, student, and exam networks. The student network is trained on unlabeled data using pseudo-labels generated by the teacher network, while the exam network functions as a temporal ensemble of the student network to stabilize performance. The teacher network is formulated as a joint regression and classification model: the regression branch predicts height values that serve as pseudo-labels, and the classification branch predicts height value classes along with class probabilities, which are used to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to address the inherent long-tailed distribution of heights, and the predicted class probabilities are calibrated with a Plackett-Luce model to reflect the expected accuracy of pseudo-labels. We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities. Codes are available at https://github.com/zhu-xlab/tse-net.

Paper Structure

This paper contains 27 sections, 16 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: TSE-Net Pipeline. The framework consists of three networks: the teacher (T), student (S), and exam (E) networks. The teacher network is a classification-regression multi-task network, where the class probabilities are calibrated to align with the regression errors via a Plackett-Luce model. The student and exam networks share the same regression architecture. On the labeled dataset, the teacher and student networks are jointly trained using the ground truth labels. On the unlabeled dataset, two views---a weak and a strong augmentation---are fed into the teacher and student networks, respectively. The teacher produces pseudo-labels that are filtered by class probabilities to supervise the student network. The exam network is updated as the exponential moving average (EMA) of the student network. During inference, only the exam network is used.
  • Figure 2: Different discretization strategies for converting regression to classification. $h_\text{max}$: maximal height value in the dataset. Uniform discretization (UD): divides the height range uniformly. Space-increasing discretization (SID): divides the height range logarithmically. Hierarchical Bi-cut (HBC): divides the height range hierarchically so that each cut splits the samples in half.
  • Figure 3: Height value Distribution of the GBH training and validation Sets. The distribution exhibits a pronounced long-tailed pattern: approximately 3e8 pixels (57 % of the total) fall within background regions below 1 m, while the frequency of pixels at larger heights decreases sharply to about 10 pixels per 1 m bin. A similar long-tailed distribution is observed for the building ground-truth height values.
  • Figure 4: Qualitative results on ISPRS Vaihingen. sup.: supervised learning; semi: semi-supervised learning; GT: ground truth; percentage on the bottom denotes the labeled ratio.
  • Figure 5: Qualitative results on SynRS3D. sup.: supervised learning; semi: semi-supervised learning; GT: ground truth; percentage on the bottom denotes the labeled ratio.
  • ...and 1 more figures