Table of Contents
Fetching ...

Learning with Noisy Ground Truth: From 2D Classification to 3D Reconstruction

Yangdi Lu, Wenbo He

TL;DR

The work addresses learning with noisy ground truth (LNGT) by formalizing LNGT, deriving an error-decomposition framework, and analyzing memorization in both 2D classification and 3D reconstruction (NeRF/3DGS). It proposes a taxonomy of solutions targeting estimation and fitting errors, including data augmentation, regularization, robust losses, sample selection, and loss correction, to achieve noise-robust learning. The paper highlights memorization as a core challenge and connects 2D label-noise robustness to 3D reconstruction under noisy imagery, suggesting practical pathways such as dynamic masking (e.g., GMM-based) and loss-based guidance. Overall, it offers a structured lens to study LNGT and provides a roadmap for robust learning across vision tasks, with potential extensions to challenging 3D scene synthesis under real-world noise.

Abstract

Deep neural networks has been highly successful in data-intense computer vision applications, while such success relies heavily on the massive and clean data. In real-world scenarios, clean data sometimes is difficult to obtain. For example, in image classification and segmentation tasks, precise annotations of millions samples are generally very expensive and time-consuming. In 3D static scene reconstruction task, most NeRF related methods require the foundational assumption of the static scene (e.g. consistent lighting condition and persistent object positions), which is often violated in real-world scenarios. To address these problem, learning with noisy ground truth (LNGT) has emerged as an effective learning method and shows great potential. In this short survey, we propose a formal definition unify the analysis of LNGT LNGT in the context of different machine learning tasks (classification and regression). Based on this definition, we propose a novel taxonomy to classify the existing work according to the error decomposition with the fundamental definition of machine learning. Further, we provide in-depth analysis on memorization effect and insightful discussion about potential future research opportunities from 2D classification to 3D reconstruction, in the hope of providing guidance to follow-up research.

Learning with Noisy Ground Truth: From 2D Classification to 3D Reconstruction

TL;DR

The work addresses learning with noisy ground truth (LNGT) by formalizing LNGT, deriving an error-decomposition framework, and analyzing memorization in both 2D classification and 3D reconstruction (NeRF/3DGS). It proposes a taxonomy of solutions targeting estimation and fitting errors, including data augmentation, regularization, robust losses, sample selection, and loss correction, to achieve noise-robust learning. The paper highlights memorization as a core challenge and connects 2D label-noise robustness to 3D reconstruction under noisy imagery, suggesting practical pathways such as dynamic masking (e.g., GMM-based) and loss-based guidance. Overall, it offers a structured lens to study LNGT and provides a roadmap for robust learning across vision tasks, with potential extensions to challenging 3D scene synthesis under real-world noise.

Abstract

Deep neural networks has been highly successful in data-intense computer vision applications, while such success relies heavily on the massive and clean data. In real-world scenarios, clean data sometimes is difficult to obtain. For example, in image classification and segmentation tasks, precise annotations of millions samples are generally very expensive and time-consuming. In 3D static scene reconstruction task, most NeRF related methods require the foundational assumption of the static scene (e.g. consistent lighting condition and persistent object positions), which is often violated in real-world scenarios. To address these problem, learning with noisy ground truth (LNGT) has emerged as an effective learning method and shows great potential. In this short survey, we propose a formal definition unify the analysis of LNGT LNGT in the context of different machine learning tasks (classification and regression). Based on this definition, we propose a novel taxonomy to classify the existing work according to the error decomposition with the fundamental definition of machine learning. Further, we provide in-depth analysis on memorization effect and insightful discussion about potential future research opportunities from 2D classification to 3D reconstruction, in the hope of providing guidance to follow-up research.
Paper Structure (16 sections, 11 equations, 6 figures, 1 table)

This paper contains 16 sections, 11 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Effects of Noisy Labels. Softmax outputs on noisy label and latent true label when training an 8-layer CNN on CIFAR10 with 40% label noise. The x-axis is epochs and the y-axis is output probability on assigned label. We compare the Vanilla training using cross entropy (CE) loss with the method that adds entropy minimization (EM) of predictions to CE. The output probability of CE+EM is more stable than CE.
  • Figure 2: Memorization Effect in Image Classification. We train ResNet34 on the CIFAR-10 with 60% noise using CE loss and investigate the loss distribution. Top row: The normalized loss distribution over different training epochs. Bottom row: The corresponding mixture model after fitting a two-component GMM to loss distribution. Two components gradually separate at the beginning and start to merge with training continues.
  • Figure 3: Memorization Effect in 3D reconstruction. The purity (clean scene) and distractor pixels correspond to the blue and red bars in the histograms, respectively. We observe that during the initial stages of optimization (e.g., the 5000th step), the image exhibits blurring, yet the scene remains relatively clean. However, as optimization progresses, the image sharpens, concomitant with the emergence of distractors. Further analyzing the accumulated loss distribution, Mip-NeRF 360 primarily focuses on learning purity pixels (of the clean scenes) and leaves most distractor pixels out of the learning process at the early stage, as evidenced by the minimal changes in the histogram of distractor pixels (red bars).
  • Figure 4: Qualitative Comparison on the RobustNeRF Dataset. Compared to the baselines, Mip-NeRF 360 and 3DGS, our Mask-NeRF and Mask-3DGS not only efficiently eliminate distractors but also retain a higher level of detail. In comparison with RobustNeRF*, MemE models demonstrate superior performance in removing distractors, leading to a reduction in artifacts and an enhancement in detail preservation.
  • Figure 5: Comparison of learning with clean and noisy ground truth.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 3.1: Machine Learning mitchell1997machine
  • Definition 3.2