Table of Contents
Fetching ...

Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

Hao Chen, Jindong Wang, Ankit Shah, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj

TL;DR

A light-weight black-box tuning method (NMTune) is proposed to affine the feature space to mitigate the malignant effect of noise and improve generalization on both ID and OOD tasks, considering one may not be able to fully fine-tune or even access the pre-trained models.

Abstract

Pre-training on large-scale datasets and then fine-tuning on downstream tasks have become a standard practice in deep learning. However, pre-training data often contain label noise that may adversely affect the generalization of the model. This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. More specifically, through extensive experiments of supervised pre-training models on synthetic noisy ImageNet-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) transfer performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing data distribution are different. We empirically verify that the reason behind is noise in pre-training shapes the feature space differently. We then propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization on both ID and OOD tasks, considering one may not be able to fully fine-tune or even access the pre-trained models. We conduct practical experiments on popular vision and language models that are pre-trained on noisy data for evaluation of our approach. Our analysis and results show the importance of this interesting and novel research direction, which we term Noisy Model Learning.

Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

TL;DR

A light-weight black-box tuning method (NMTune) is proposed to affine the feature space to mitigate the malignant effect of noise and improve generalization on both ID and OOD tasks, considering one may not be able to fully fine-tune or even access the pre-trained models.

Abstract

Pre-training on large-scale datasets and then fine-tuning on downstream tasks have become a standard practice in deep learning. However, pre-training data often contain label noise that may adversely affect the generalization of the model. This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. More specifically, through extensive experiments of supervised pre-training models on synthetic noisy ImageNet-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) transfer performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing data distribution are different. We empirically verify that the reason behind is noise in pre-training shapes the feature space differently. We then propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization on both ID and OOD tasks, considering one may not be able to fully fine-tune or even access the pre-trained models. We conduct practical experiments on popular vision and language models that are pre-trained on noisy data for evaluation of our approach. Our analysis and results show the importance of this interesting and novel research direction, which we term Noisy Model Learning.
Paper Structure (30 sections, 6 equations, 17 figures, 16 tables)

This paper contains 30 sections, 6 equations, 17 figures, 16 tables.

Figures (17)

  • Figure 1: In-domain (ID) and out-of-domain (OOD) downstream performance when supervised pre-training the model on synthetic noisy ImageNet-1K (IN-1K) and YFCC15M of various noise ratios. We compare linear probing (LP) and the proposed method on 14 ID and 4 OOD tasks. On ID, $5\%$ noise in pre-training benefits the LP performance. Our method not only boosts the general performance but also rectifies the model pre-trained on clean data to be comparable to $5\%$ noise. On OOD, noise in pre-training is detrimental to robustness performance when conducting LP. Our method improves the transferability on OOD tasks significantly compared to LP.
  • Figure 2: Average ID and OOD evaluation results of ImageNet-1K (IN-1K) fully supervised pre-training ((a) and (b)) and YFCC15M CLIP pre-training ((c) and (d)) on downstream tasks with various percentages of data using ResNet-50. On ID evaluation, the transfer performance first increases as noise increases (to $5\%$ or $10\%$) and then decreases with more noise. On OOD evaluation, the robustness performance constantly decreases once noise is introduced in pre-training.
  • Figure 3: Feature SVD analysis. We compute the singular value entropy (SVE) for in-domain (ID) tasks and the largest singular value ratio (LSVR) for out-of-domain (OOD) tasks. Both metrics are computed for ImageNet-1K fully supervised pre-trained ((a) and (b)) and YFCC15M CLIP pre-trained ((c) and (d)) models. The SVE first slightly improves as the noise ratio increases to $5\%$ or $10\%$, indicating better generalization. As the noise ratio increases, the SVE further improves, and the LSVR drops significantly, corresponding to worse generalization on ID and OOD tasks, as more noise structure is learned. The dominant singular components become less transferable.
  • Figure 4: Illustration of noisy label learning (left) and the proposed Noisy Model Learning (right). Noisy label learning mainly focuses on robustly training a model from scratch or fine-tuning a model from pre-training on a noisy dataset. Noisy model learning focuses on robustly adapting the black-box noisy pre-trained models to downstream datasets with no assumption on the downstream dataset.
  • Figure 5: Evaluation of our method on ID and OOD downstream tasks, compared to MLP tuning and LP on ResNet-50 models pre-trained on ImageNet-1K (IN-1K) and YFCC15M. (a) Average F1 score on ID tasks; (b) SVE on ID tasks; (c) Average F1 score on OOD tasks; (d) LSVR on OOD tasks. Our method presents better SVE and LSVR on both ID and OOD tasks with better generalization performance. Our method also rectifies the malignant noise effect: the feature extractor pre-trained on clean data now exhibits better performance than others on noisy data on ID tasks; and the performance gap between the clean one and the one with $5\%$ noise gets smaller on OOD tasks.
  • ...and 12 more figures

Theorems & Definitions (2)

  • Definition 2.1: Singular Value Entropy
  • Definition 2.2: Largest Singular Value Ratio