Table of Contents
Fetching ...

AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic Segmentation via Adaptive Label Correction

Pufan Zou, Shijia Zhao, Weijie Huang, Qiming Xia, Chenglu Wen, Wei Li, Cheng Wang

TL;DR

AdaCo tackles noisy supervision in 3D semantic segmentation under outdoor conditions by marrying cross-modal information from Visual Foundation Models (VFMs) with adaptive label correction and robust training. It introduces CLGM to lift 2D VFM-generated descriptions into 3D space, ANC to iteratively refurbish noisy labels using early-learning signals and clustering, and ARL to regulate learning by blending a warmup loss with a corrected loss where $L_{corr}=\lambda L_{NCE}+\beta L_{MAE}$ and $L_{warmup}$ includes $L_{CE}$, $L_{MSE}$, and Lovasz loss; after $t_c$ a negative weight on $L_{CE}$ (e.g., $\sigma=-0.99$) attenuates overfitting. Empirical results on SemanticKITTI and nuScenes show substantial improvements over existing label-free methods, with ablations confirming the additive value of CLGM, ANC, and ARL. The work demonstrates a scalable approach to open-world outdoor perception by converting 2D semantic understanding into robust 3D supervision without manual annotations, while noting calibration dependence and VFMs’ coverage as limitations and future directions.

Abstract

Recently, Visual Foundation Models (VFMs) have shown a remarkable generalization performance in 3D perception tasks. However, their effectiveness in large-scale outdoor datasets remains constrained by the scarcity of accurate supervision signals, the extensive noise caused by variable outdoor conditions, and the abundance of unknown objects. In this work, we propose a novel label-free learning method, Adaptive Label Correction (AdaCo), for 3D semantic segmentation. AdaCo first introduces the Cross-modal Label Generation Module (CLGM), providing cross-modal supervision with the formidable interpretive capabilities of the VFMs. Subsequently, AdaCo incorporates the Adaptive Noise Corrector (ANC), updating and adjusting the noisy samples within this supervision iteratively during training. Moreover, we develop an Adaptive Robust Loss (ARL) function to modulate each sample's sensitivity to noisy supervision, preventing potential underfitting issues associated with robust loss. Our proposed AdaCo can effectively mitigate the performance limitations of label-free learning networks in 3D semantic segmentation tasks. Extensive experiments on two outdoor benchmark datasets highlight the superior performance of our method.

AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic Segmentation via Adaptive Label Correction

TL;DR

AdaCo tackles noisy supervision in 3D semantic segmentation under outdoor conditions by marrying cross-modal information from Visual Foundation Models (VFMs) with adaptive label correction and robust training. It introduces CLGM to lift 2D VFM-generated descriptions into 3D space, ANC to iteratively refurbish noisy labels using early-learning signals and clustering, and ARL to regulate learning by blending a warmup loss with a corrected loss where and includes , , and Lovasz loss; after a negative weight on (e.g., ) attenuates overfitting. Empirical results on SemanticKITTI and nuScenes show substantial improvements over existing label-free methods, with ablations confirming the additive value of CLGM, ANC, and ARL. The work demonstrates a scalable approach to open-world outdoor perception by converting 2D semantic understanding into robust 3D supervision without manual annotations, while noting calibration dependence and VFMs’ coverage as limitations and future directions.

Abstract

Recently, Visual Foundation Models (VFMs) have shown a remarkable generalization performance in 3D perception tasks. However, their effectiveness in large-scale outdoor datasets remains constrained by the scarcity of accurate supervision signals, the extensive noise caused by variable outdoor conditions, and the abundance of unknown objects. In this work, we propose a novel label-free learning method, Adaptive Label Correction (AdaCo), for 3D semantic segmentation. AdaCo first introduces the Cross-modal Label Generation Module (CLGM), providing cross-modal supervision with the formidable interpretive capabilities of the VFMs. Subsequently, AdaCo incorporates the Adaptive Noise Corrector (ANC), updating and adjusting the noisy samples within this supervision iteratively during training. Moreover, we develop an Adaptive Robust Loss (ARL) function to modulate each sample's sensitivity to noisy supervision, preventing potential underfitting issues associated with robust loss. Our proposed AdaCo can effectively mitigate the performance limitations of label-free learning networks in 3D semantic segmentation tasks. Extensive experiments on two outdoor benchmark datasets highlight the superior performance of our method.

Paper Structure

This paper contains 27 sections, 10 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: In outdoor scenes, variable environmental conditions and a rich variety of object categories introduce significant noise when VFMs are applied to 3D perception tasks. Adaco effectively improves the label quality of the baseline method, MaskCLIP, in major categories in road scenes.
  • Figure 2: The illustration of our label-free 3D semantic segmentation method AdaCo. (a) The 3D noisy pseudo labels are obtained by transferring the VFMs-generate pixel-wise annotation to each point based on CLGM. (b) These noisy pseudo labels are then adaptively refurbished by the proposed ANC. Furthermore, (c) ARL is employed to adaptively adjust the loss method based on the training IoU curve and enforce different penalties for noisy and clean labels.
  • Figure 3: CLGM pipeline. We utilize the composition of SAM and SSA-Engine as our 2D-PLGE to segment masks with semantic description, and then calculate the semantic similarity between text prompt and description word by word, and map the class corresponding to the highest semantic similarity to each pixel as its label. Finally, we transfer the pixel-wise label to the point after refined by inter-frame voxel-voting.
  • Figure 4: The learning curve of different mIoU in the SemanticKITTI train set. The training mIoU is calculated with the noisy ground truth, while the early learning IoU curve is calculated with the correct ground truth.
  • Figure 5: Qualitative results in SemanticKITTI (a)-(d) and nuScenes (e)-(h), the noisy ground truth is generated from CLGM. As shown by the red circle in the figure, our Adaco remembered clean samples and mitigated the blurring of object edges.
  • ...and 3 more figures