AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic Segmentation via Adaptive Label Correction
Pufan Zou, Shijia Zhao, Weijie Huang, Qiming Xia, Chenglu Wen, Wei Li, Cheng Wang
TL;DR
AdaCo tackles noisy supervision in 3D semantic segmentation under outdoor conditions by marrying cross-modal information from Visual Foundation Models (VFMs) with adaptive label correction and robust training. It introduces CLGM to lift 2D VFM-generated descriptions into 3D space, ANC to iteratively refurbish noisy labels using early-learning signals and clustering, and ARL to regulate learning by blending a warmup loss with a corrected loss where $L_{corr}=\lambda L_{NCE}+\beta L_{MAE}$ and $L_{warmup}$ includes $L_{CE}$, $L_{MSE}$, and Lovasz loss; after $t_c$ a negative weight on $L_{CE}$ (e.g., $\sigma=-0.99$) attenuates overfitting. Empirical results on SemanticKITTI and nuScenes show substantial improvements over existing label-free methods, with ablations confirming the additive value of CLGM, ANC, and ARL. The work demonstrates a scalable approach to open-world outdoor perception by converting 2D semantic understanding into robust 3D supervision without manual annotations, while noting calibration dependence and VFMs’ coverage as limitations and future directions.
Abstract
Recently, Visual Foundation Models (VFMs) have shown a remarkable generalization performance in 3D perception tasks. However, their effectiveness in large-scale outdoor datasets remains constrained by the scarcity of accurate supervision signals, the extensive noise caused by variable outdoor conditions, and the abundance of unknown objects. In this work, we propose a novel label-free learning method, Adaptive Label Correction (AdaCo), for 3D semantic segmentation. AdaCo first introduces the Cross-modal Label Generation Module (CLGM), providing cross-modal supervision with the formidable interpretive capabilities of the VFMs. Subsequently, AdaCo incorporates the Adaptive Noise Corrector (ANC), updating and adjusting the noisy samples within this supervision iteratively during training. Moreover, we develop an Adaptive Robust Loss (ARL) function to modulate each sample's sensitivity to noisy supervision, preventing potential underfitting issues associated with robust loss. Our proposed AdaCo can effectively mitigate the performance limitations of label-free learning networks in 3D semantic segmentation tasks. Extensive experiments on two outdoor benchmark datasets highlight the superior performance of our method.
