Table of Contents
Fetching ...

3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

Wenxin Chen, Mengxue Qu, Weitai Kang, Yan Yan, Yao Zhao, Yunchao Wei

TL;DR

The paper addresses 3D Referring Expression Segmentation (3D-RES) under limited annotations by introducing 3DResT, the first semi-supervised framework for this task. It combines a teacher-student framework with burn-in and EMA updates, optimizing a supervised loss $\mathcal{L}_{\text{sup}}$ plus an unsupervised loss $\mathcal{L}_{\text{unsup}}$ weighted by $\lambda_u$, and introduces two mechanisms: Teacher-Student Consistency-Based Sampling ($\text{TSCS}$) to selectively augment the labeled set with high-quality pseudo-labels via a correlation measure $correl$, and Quality-Driven Dynamic Weighting ($\text{QDW}$) to weight unlabeled samples by $\text{IoU}(Y_s^u, \hat{Y}_t^u)$. On ScanRefer, 3DResT significantly outperforms fully supervised baselines, achieving an $+8.34$ point gain in $mIoU$ with only 1% labeled data, and surpasses prior SSL approaches like RefTeacher. The approach demonstrates that carefully leveraging both high- and low-quality pseudo-labels can markedly reduce annotation costs while delivering robust 3D-language–driven segmentation, with implications for scalable 3D vision-language systems.

Abstract

3D Referring Expression Segmentation (3D-RES) typically requires extensive instance-level annotations, which are time-consuming and costly. Semi-supervised learning (SSL) mitigates this by using limited labeled data alongside abundant unlabeled data, improving performance while reducing annotation costs. SSL uses a teacher-student paradigm where teacher generates high-confidence-filtered pseudo-labels to guide student. However, in the context of 3D-RES, where each label corresponds to a single mask and labeled data is scarce, existing SSL methods treat high-quality pseudo-labels merely as auxiliary supervision, which limits the model's learning potential. The reliance on high-confidence thresholds for filtering often results in potentially valuable pseudo-labels being discarded, restricting the model's ability to leverage the abundant unlabeled data. Therefore, we identify two critical challenges in semi-supervised 3D-RES, namely, inefficient utilization of high-quality pseudo-labels and wastage of useful information from low-quality pseudo-labels. In this paper, we introduce the first semi-supervised learning framework for 3D-RES, presenting a robust baseline method named 3DResT. To address these challenges, we propose two novel designs called Teacher-Student Consistency-Based Sampling (TSCS) and Quality-Driven Dynamic Weighting (QDW). TSCS aids in the selection of high-quality pseudo-labels, integrating them into the labeled dataset to strengthen the labeled supervision signals. QDW preserves low-quality pseudo-labels by dynamically assigning them lower weights, allowing for the effective extraction of useful information rather than discarding them. Extensive experiments conducted on the widely used benchmark demonstrate the effectiveness of our method. Notably, with only 1% labeled data, 3DResT achieves an mIoU improvement of 8.34 points compared to the fully supervised method.

3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

TL;DR

The paper addresses 3D Referring Expression Segmentation (3D-RES) under limited annotations by introducing 3DResT, the first semi-supervised framework for this task. It combines a teacher-student framework with burn-in and EMA updates, optimizing a supervised loss plus an unsupervised loss weighted by , and introduces two mechanisms: Teacher-Student Consistency-Based Sampling () to selectively augment the labeled set with high-quality pseudo-labels via a correlation measure , and Quality-Driven Dynamic Weighting () to weight unlabeled samples by . On ScanRefer, 3DResT significantly outperforms fully supervised baselines, achieving an point gain in with only 1% labeled data, and surpasses prior SSL approaches like RefTeacher. The approach demonstrates that carefully leveraging both high- and low-quality pseudo-labels can markedly reduce annotation costs while delivering robust 3D-language–driven segmentation, with implications for scalable 3D vision-language systems.

Abstract

3D Referring Expression Segmentation (3D-RES) typically requires extensive instance-level annotations, which are time-consuming and costly. Semi-supervised learning (SSL) mitigates this by using limited labeled data alongside abundant unlabeled data, improving performance while reducing annotation costs. SSL uses a teacher-student paradigm where teacher generates high-confidence-filtered pseudo-labels to guide student. However, in the context of 3D-RES, where each label corresponds to a single mask and labeled data is scarce, existing SSL methods treat high-quality pseudo-labels merely as auxiliary supervision, which limits the model's learning potential. The reliance on high-confidence thresholds for filtering often results in potentially valuable pseudo-labels being discarded, restricting the model's ability to leverage the abundant unlabeled data. Therefore, we identify two critical challenges in semi-supervised 3D-RES, namely, inefficient utilization of high-quality pseudo-labels and wastage of useful information from low-quality pseudo-labels. In this paper, we introduce the first semi-supervised learning framework for 3D-RES, presenting a robust baseline method named 3DResT. To address these challenges, we propose two novel designs called Teacher-Student Consistency-Based Sampling (TSCS) and Quality-Driven Dynamic Weighting (QDW). TSCS aids in the selection of high-quality pseudo-labels, integrating them into the labeled dataset to strengthen the labeled supervision signals. QDW preserves low-quality pseudo-labels by dynamically assigning them lower weights, allowing for the effective extraction of useful information rather than discarding them. Extensive experiments conducted on the widely used benchmark demonstrate the effectiveness of our method. Notably, with only 1% labeled data, 3DResT achieves an mIoU improvement of 8.34 points compared to the fully supervised method.

Paper Structure

This paper contains 17 sections, 10 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Our proposed model can efficiently leverage the unlabeled data and perform favorably against the fully supervised method and the existing semi-supervised method in 2D referring work. Supervised method only uses labeled supervision, while RefTeacher and 3DResT leverage both labeled supervision and unlabeled data.
  • Figure 2: The overall semi-supervised 3D-RES framework, 3DResT, consists of two 3D-RES networks with identical configurations, referred to as the Teacher and Student. The Teacher predicts pseudo-labels for unlabeled data, which are used to train the Student alongside a small number of labeled samples. The Teacher is updated via EMA ref46 from the Student. Additionally, Teacher-Student Consistency-Based Sampling (TSCS) and Quality-Driven Dynamic Weighting (QDW) are employed to address the challenges of inefficient utilization of high-quality pseudo-labels and wastage of useful information from low-quality pseudo-labels. The correl in TSCS is a metric helping to better select high-quality pseudo-labels which will be described in Section \ref{['III-C']} later.
  • Figure 3: Visualizations of 3DResT and fully supervised baselines. Subfigure (a) and (b) indicate that TSCS and QDW of 3DResT can obviously improve the quality of predictions under semi-supervised settings.
  • Figure 4: Comparisons of Supervised method and 3DResT. The blue mask means the ground truth, the red mask means the prediction of supervised method and the green mask means the prediction of 3DResT method