Impact of Noisy Supervision in Foundation Model Learning

Hao Chen; Zihan Wang; Ran Tao; Hongxin Wei; Xing Xie; Masashi Sugiyama; Bhiksha Raj; Jindong Wang

Impact of Noisy Supervision in Foundation Model Learning

Hao Chen, Zihan Wang, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj, Jindong Wang

TL;DR

This work investigates how label noise in large pre-training datasets affects downstream task transfer, especially under black-box or parameter-efficient tuning. It jointly analyzes the learned feature space via singular value metrics and demonstrates that small amounts of pre-training noise can improve in-domain performance but consistently harms out-of-domain generalization. To mitigate these malignant effects, the authors propose Noisy Model Tuning (NMTune), a lightweight regularization-based method that reshapes the downstream feature space and is compatible with both black-box and PEFT approaches. The results span vision and language models, including API-based systems, and show that NMTune enhances robustness and transferability across ID and OOD tasks, marking a step toward robust Noisy Model Learning for foundation models.

Abstract

Foundation models are usually pre-trained on large-scale datasets and then adapted to downstream tasks through tuning. However, the large-scale pre-training datasets, often inaccessible or too expensive to handle, can contain label noise that may adversely affect the generalization of the model and pose unexpected risks. This paper stands out as the first work to comprehensively understand and analyze the nature of noise in pre-training datasets and then effectively mitigate its impacts on downstream tasks. Specifically, through extensive experiments of fully-supervised and image-text contrastive pre-training on synthetic noisy ImageNet-1K, YFCC15M, and CC12M datasets, we demonstrate that, while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share a similar distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are significantly different. These observations are agnostic to scales of pre-training datasets, pre-training noise types, model architectures, pre-training objectives, downstream tuning methods, and downstream applications. We empirically ascertain that the reason behind this is that the pre-training noise shapes the feature space differently. We then propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization, which is applicable in both parameter-efficient and black-box tuning manners. We additionally conduct extensive experiments on popular vision and language models, including APIs, which are supervised and self-supervised pre-trained on realistic noisy data for evaluation. Our analysis and results demonstrate the importance of this novel and fundamental research direction, which we term as Noisy Model Learning.

Impact of Noisy Supervision in Foundation Model Learning

TL;DR

Abstract

Paper Structure (26 sections, 7 equations, 10 figures, 6 tables)

This paper contains 26 sections, 7 equations, 10 figures, 6 tables.

Introduction
Introduction
Related Work
Noisy Label Learning
Pre-training and Fine-Tuning
Pre-training Data Biases
Noisy Model Transfer Learning
Understanding the Pre-training Noise
Experiments Design
Pre-training Results
Results on Downstream Classification Tasks
Downstream results using linear probing
Downstream results using LoRA
Downstream results using full fine-tuning
Results on Detection and Segmentation Tasks
...and 11 more sections

Figures (10)

Figure 1: Illustration of noisy label learning (left) and the proposed noisy model transfer learning (right). Noisy label learning mainly focuses on robustly training a model from scratch or fine-tuning a model from pre-training on a noisy downstream dataset. Noisy model transfer learning focuses on robustly adapting the (partially) black-box noisy pre-trained models to various downstream tasks, where we do not make additional assumption.
Figure 2: In-domain (ID) and out-of-domain (OOD) downstream performance when fully-supervised (FS) pre-training the model on synthetic noisy ImageNet-1K (IN-1K) and image-text contrastive pre-training YFCC15M (and CC12M) of ResNet-50 and ViT-B-16 on various noise ratios. We compare linear probing (LP) and parameter-efficient tuning of LoRA, with the proposed method. On ID, $5\%$ noise in pre-training benefits the LP performance. Our method not only boosts the general performance but also rectifies the model pre-trained on clean data to be comparable to $5\%$ noise. On OOD, noise in pre-training is detrimental to robustness performance when conducting LP and LoRA. Our method improves the transferability on OOD tasks significantly.
Figure 3: Average ID and OOD evaluation results of ResNet-50 (top row) and ViT-B-16 (bottom row), using ImageNet-1K (IN-1K) fully-supervised pre-training ((a), (b), (e), (f)) and YFCC15M (and CC12M) CLIP pre-training ((c), (d), (g), (h)) on downstream tasks with various percentages of data. For both ResNet-50 and ViT-B-16 pre-trained on datasets of different scales, on ID evaluation, the transferring performance first increases as noise increases (to $5\%$ or $10\%$) and then decreases with more noise. On OOD evaluation, the robustness performance constantly decreases.
Figure 4: Average ID and OOD tuning results of ResNet-50 (top row) and ViT-B-16 (bottom row), using ImageNet-1K (IN-1K) fully-supervised pre-training ((a), (b), (e), (f)) and YFCC15M (and CC12M) CLIP pre-training ((c), (d), (g), (h)) on downstream tasks with full data. For ResNet-50, we adopt linear probing (LP) and full fine-tuning (FT). For ViT-B-16, we additionally adopt LoRA. On different tuning methods, we find similar observations for downstream tasks, where slight noise in pre-training benefits the model's ID performance but always hurts the OOD performance. As more pre-trained parameters are modified on downstream tasks, i.e., from LP (to LoRA) to FT, the difference (shown on the top of each bar) between noisy pre-trained models becomes smaller in terms of both ID benefits (with slight noise) and OOD deterioration.
Figure 5: Feature SVD analysis of ResNet-50 (top row) and ViT-B-16 (bottom row). We compute the singular value entropy (SVE) for in-domain (ID) tasks and the largest singular value ratio (LSVR) for out-of-domain (OOD) tasks. Both metrics are computed for ImageNet-1K fully-supervised pre-trained ((a), (b), (e), (f)) and YFCC15M (and CC12M) CLIP pre-trained ((c), (d), (g), (h)) models. For both model architectures, the SVE first slightly improves as the noise ratio increases to $5\%$ or $10\%$, indicating better generalization. As the noise ratio increases, the SVE further improves, and the LSVR drops significantly, corresponding to worse generalization on OOD tasks. The dominant singular components become less transferable.
...and 5 more figures

Theorems & Definitions (3)

Definition 4.1
Definition 6.1: Singular Value Entropy
Definition 6.2: Largest Singular Value Ratio

Impact of Noisy Supervision in Foundation Model Learning

TL;DR

Abstract

Impact of Noisy Supervision in Foundation Model Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (3)