Table of Contents
Fetching ...

Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation

Qian-Wei Wang, Yuqiu Xie, Letian Zhang, Zimo Liu, Shu-Tao Xia

TL;DR

The paper addresses instance-dependent noise in labels produced by pre-trained vision-language models when learning downstream tasks under noisy partial labeling. It proposes Co-Reg, a collaborative consistency regularization framework with two networks performing co-pseudo-labeling, self-training, prototypical similarity alignment, and noisy contrastive learning to robustly recover ground-truth distributions from VLM annotations. Across six datasets and multiple VLM backbones, Co-Reg consistently outperforms state-of-the-art NPLL and KD baselines, including in semi-supervised settings with a few manually labeled examples, demonstrating annotation-free yet effective downstream adaptation. By uniting weakly-supervised learning with distillation-style knowledge transfer, the approach offers practical, scalable improvements for leveraging large vision-language models in real-world tasks without extensive manual labeling.

Abstract

In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVA and GPT-4V, the direction of using these models to replace time-consuming manual annotation workflows and achieve ``manual-annotation-free" training for downstream tasks has become a highly promising research avenue. This paper focuses on learning from noisy partial labels annotated by pre-trained VLMs and proposes an innovative collaborative consistency regularization (Co-Reg) method. Unlike the symmetric noise primarily addressed in traditional noisy label learning, the noise generated by pre-trained models is instance-dependent, embodying the underlying patterns of the pre-trained models themselves, which significantly increases the learning difficulty for the model. To address this, we simultaneously train two neural networks that implement collaborative purification of training labels through a ``Co-Pseudo-Labeling" mechanism, while enforcing consistency regularization constraints in both the label space and feature representation space. Specifically, we construct multiple anti-overfitting mechanisms that efficiently mine latent information from noisy partially labeled samples including alternating optimization of contrastive feature representations and pseudo-labels, as well as maintaining prototypical class vectors in the shared feature space.

Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation

TL;DR

The paper addresses instance-dependent noise in labels produced by pre-trained vision-language models when learning downstream tasks under noisy partial labeling. It proposes Co-Reg, a collaborative consistency regularization framework with two networks performing co-pseudo-labeling, self-training, prototypical similarity alignment, and noisy contrastive learning to robustly recover ground-truth distributions from VLM annotations. Across six datasets and multiple VLM backbones, Co-Reg consistently outperforms state-of-the-art NPLL and KD baselines, including in semi-supervised settings with a few manually labeled examples, demonstrating annotation-free yet effective downstream adaptation. By uniting weakly-supervised learning with distillation-style knowledge transfer, the approach offers practical, scalable improvements for leveraging large vision-language models in real-world tasks without extensive manual labeling.

Abstract

In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVA and GPT-4V, the direction of using these models to replace time-consuming manual annotation workflows and achieve ``manual-annotation-free" training for downstream tasks has become a highly promising research avenue. This paper focuses on learning from noisy partial labels annotated by pre-trained VLMs and proposes an innovative collaborative consistency regularization (Co-Reg) method. Unlike the symmetric noise primarily addressed in traditional noisy label learning, the noise generated by pre-trained models is instance-dependent, embodying the underlying patterns of the pre-trained models themselves, which significantly increases the learning difficulty for the model. To address this, we simultaneously train two neural networks that implement collaborative purification of training labels through a ``Co-Pseudo-Labeling" mechanism, while enforcing consistency regularization constraints in both the label space and feature representation space. Specifically, we construct multiple anti-overfitting mechanisms that efficiently mine latent information from noisy partially labeled samples including alternating optimization of contrastive feature representations and pseudo-labels, as well as maintaining prototypical class vectors in the shared feature space.

Paper Structure

This paper contains 38 sections, 1 theorem, 22 equations, 5 figures, 9 tables.

Key Result

Proposition 1

Given the bregman divergence $d(a, b) = d_{\phi}(a, b) = ||a - b||^2$, the prototype of each class $j$ has a unique formulation to minimize the problem of Eq.prototype_2, which is given by $o^*_j = \mathbb{E}_{z \sim p(z|j)}[z]$

Figures (5)

  • Figure 1: Schematic diagram of using CLIP and multiple prompt templates to annotate images from downstream tasks with noisy partial labels (candidate label sets). In this process, each prompt template is combined with all class names of the task to form text inputs, which are then encoded by the text encoder to obtain text embeddings. These text embeddings are matched with image embeddings (derived from the image encoder) to generate CLIP's predicted class distribution for each image.
  • Figure 2: Schematic diagram of the Co-Pseudo-Labeling step in our method (taking the example of using the knowledge of Net1 to assist the training of Net2). We use Net1 to divide the training set into a "Partial Set" and an "Unlabeled Set" based on the credibility of the partial labels annotated by the pre-trained model. For samples in the Partial Set where the partial labels are considered trustworthy, we only retain their prediction probabilities on the candidate labels. Then, the prediction probabilities of Net1 and the prediction probabilities of Net2 itself are fused and provided to Net2 for training in the next epoch.
  • Figure 3: Schematic diagram of the self-training and feature representation optimization in our method. We use the pseudo-labels assigned by the Co-Pseudo-Labeling to perform consistency regularized training on strongly-augmented samples. Meanwhile, we use weakly-augmented samples to maintain a prototype vector for the projected feature representation of each category (shown in bold color) in a shared representation space between both networks, and enforce that the similarity distribution between the projected representations of strongly-augmented samples and the prototype vectors aligns with the predicted class distribution of these samples. Additionally, we maintain a momentum-updated network for each neural network to iteratively optimize the model's representation ability and pseudo-labels via noisy supervised contrastive learning.
  • Figure 4: Sensitivity analysis of key hyper-parameters in our proposed method. Each subfigure illustrates the performance variation with respect to specific hyper-parameters: $\lambda_u$ (weight coefficient of unlabeled set loss), $d'$ (feature dimension of projected representations for prototypical similarity alignment and noisy contrastive learning), $\tau_{\text{div}}$ (threshold for dividing partial and unlabeled set), and $T$ (sharpening temperature parameter in co-pseudo-labeling).
  • Figure 5: Accuracy changes of our algorithm when using limited sample ratios (Zero-Shot, 20%, 40%, 60%, 80%, and 100%) on the CIFAR-10, CIFAR-100, SVHN, and EuroSAT datasets. The red asterisk denotes Zero-Shot performance, while the blue line shows results for increasing sample ratios (20–100%). Each subfigure corresponds to a model (CLIP ViT-B/32 or LLaVA-1.5) and dataset, illustrating accuracy improvements with growing labeled data.

Theorems & Definitions (2)

  • Proposition 1
  • Proof 1