3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

Wenxin Chen; Mengxue Qu; Weitai Kang; Yan Yan; Yao Zhao; Yunchao Wei

3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

Wenxin Chen, Mengxue Qu, Weitai Kang, Yan Yan, Yao Zhao, Yunchao Wei

TL;DR

The paper addresses 3D Referring Expression Segmentation (3D-RES) under limited annotations by introducing 3DResT, the first semi-supervised framework for this task. It combines a teacher-student framework with burn-in and EMA updates, optimizing a supervised loss $\mathcal{L}_{\text{sup}}$ plus an unsupervised loss $\mathcal{L}_{\text{unsup}}$ weighted by $\lambda_u$, and introduces two mechanisms: Teacher-Student Consistency-Based Sampling ($\text{TSCS}$) to selectively augment the labeled set with high-quality pseudo-labels via a correlation measure $correl$, and Quality-Driven Dynamic Weighting ($\text{QDW}$) to weight unlabeled samples by $\text{IoU}(Y_s^u, \hat{Y}_t^u)$. On ScanRefer, 3DResT significantly outperforms fully supervised baselines, achieving an $+8.34$ point gain in $mIoU$ with only 1% labeled data, and surpasses prior SSL approaches like RefTeacher. The approach demonstrates that carefully leveraging both high- and low-quality pseudo-labels can markedly reduce annotation costs while delivering robust 3D-language–driven segmentation, with implications for scalable 3D vision-language systems.

Abstract

3D Referring Expression Segmentation (3D-RES) typically requires extensive instance-level annotations, which are time-consuming and costly. Semi-supervised learning (SSL) mitigates this by using limited labeled data alongside abundant unlabeled data, improving performance while reducing annotation costs. SSL uses a teacher-student paradigm where teacher generates high-confidence-filtered pseudo-labels to guide student. However, in the context of 3D-RES, where each label corresponds to a single mask and labeled data is scarce, existing SSL methods treat high-quality pseudo-labels merely as auxiliary supervision, which limits the model's learning potential. The reliance on high-confidence thresholds for filtering often results in potentially valuable pseudo-labels being discarded, restricting the model's ability to leverage the abundant unlabeled data. Therefore, we identify two critical challenges in semi-supervised 3D-RES, namely, inefficient utilization of high-quality pseudo-labels and wastage of useful information from low-quality pseudo-labels. In this paper, we introduce the first semi-supervised learning framework for 3D-RES, presenting a robust baseline method named 3DResT. To address these challenges, we propose two novel designs called Teacher-Student Consistency-Based Sampling (TSCS) and Quality-Driven Dynamic Weighting (QDW). TSCS aids in the selection of high-quality pseudo-labels, integrating them into the labeled dataset to strengthen the labeled supervision signals. QDW preserves low-quality pseudo-labels by dynamically assigning them lower weights, allowing for the effective extraction of useful information rather than discarding them. Extensive experiments conducted on the widely used benchmark demonstrate the effectiveness of our method. Notably, with only 1% labeled data, 3DResT achieves an mIoU improvement of 8.34 points compared to the fully supervised method.

3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

TL;DR

Abstract

3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)