Table of Contents
Fetching ...

Meta Co-Training: Two Views are Better than One

Jay C. Rothenberger, Dimitrios I. Diochnos

TL;DR

Meta Co-Training addresses the challenge of semi-supervised learning when true multi-view data is unavailable by constructing two complementary views from pre-trained representations and optimizing a bi-level student-teacher objective. The method jointly leverages pseudo-labels across views while using labeled data to supervise and refine the teacher, resulting in robust improvements over traditional co-training, especially when view content differs significantly. Empirical results on ImageNet-10% establish new state-of-the-art performance, with additional gains on Flowers102, Food101, FGVC Aircraft, and iNaturalist datasets, demonstrating MCT’s robustness to view imbalance and its advantage over deep ensembles in many settings. The findings highlight the practical value of using pre-trained foundation-model embeddings as interchangeable views to unlock effective semi-supervised learning without extensive retraining.

Abstract

In many critical computer vision scenarios unlabeled data is plentiful, but labels are scarce and difficult to obtain. As a result, semi-supervised learning which leverages unlabeled data to boost the performance of supervised classifiers have received significant attention in recent literature. One representative class of semi-supervised algorithms are co-training algorithms. Co-training algorithms leverage two different models which have access to different independent and sufficient representations or "views" of the data to jointly make better predictions. Each of these models creates pseudo-labels on unlabeled points which are used to improve the other model. We show that in the common case where independent views are not available, we can construct such views inexpensively using pre-trained models. Co-training on the constructed views yields a performance improvement over any of the individual views we construct and performance comparable with recent approaches in semi-supervised learning. We present Meta Co-Training, a novel semi-supervised learning algorithm, which has two advantages over co-training: (i) learning is more robust when there is large discrepancy between the information content of the different views, and (ii) does not require retraining from scratch on each iteration. Our method achieves new state-of-the-art performance on ImageNet-10% achieving a ~4.7% reduction in error rate over prior work. Our method also outperforms prior semi-supervised work on several other fine-grained image classification datasets.

Meta Co-Training: Two Views are Better than One

TL;DR

Meta Co-Training addresses the challenge of semi-supervised learning when true multi-view data is unavailable by constructing two complementary views from pre-trained representations and optimizing a bi-level student-teacher objective. The method jointly leverages pseudo-labels across views while using labeled data to supervise and refine the teacher, resulting in robust improvements over traditional co-training, especially when view content differs significantly. Empirical results on ImageNet-10% establish new state-of-the-art performance, with additional gains on Flowers102, Food101, FGVC Aircraft, and iNaturalist datasets, demonstrating MCT’s robustness to view imbalance and its advantage over deep ensembles in many settings. The findings highlight the practical value of using pre-trained foundation-model embeddings as interchangeable views to unlock effective semi-supervised learning without extensive retraining.

Abstract

In many critical computer vision scenarios unlabeled data is plentiful, but labels are scarce and difficult to obtain. As a result, semi-supervised learning which leverages unlabeled data to boost the performance of supervised classifiers have received significant attention in recent literature. One representative class of semi-supervised algorithms are co-training algorithms. Co-training algorithms leverage two different models which have access to different independent and sufficient representations or "views" of the data to jointly make better predictions. Each of these models creates pseudo-labels on unlabeled points which are used to improve the other model. We show that in the common case where independent views are not available, we can construct such views inexpensively using pre-trained models. Co-training on the constructed views yields a performance improvement over any of the individual views we construct and performance comparable with recent approaches in semi-supervised learning. We present Meta Co-Training, a novel semi-supervised learning algorithm, which has two advantages over co-training: (i) learning is more robust when there is large discrepancy between the information content of the different views, and (ii) does not require retraining from scratch on each iteration. Our method achieves new state-of-the-art performance on ImageNet-10% achieving a ~4.7% reduction in error rate over prior work. Our method also outperforms prior semi-supervised work on several other fine-grained image classification datasets.
Paper Structure (31 sections, 10 equations, 16 figures, 11 tables, 1 algorithm)

This paper contains 31 sections, 10 equations, 16 figures, 11 tables, 1 algorithm.

Figures (16)

  • Figure 1: At each step $t\in\{1, \ldots, T\}$ of meta co-training the models that correspond to the so-far learnt parameters $\theta_{1; t}$ and $\theta_{2; t}$ play the role of the student and the teacher simultaneously using batches for their respective views. Pseudo-labeling occurs on complementary views so that the teacher can provide the student with labels on an unlabeled batch. Labeled batches may, or may not, use complementary views as the purpose that they serve is to calculate the risk of the student model on the labeled batch and this result signals the teacher model to update its weights accordingly.
  • Figure 2: Top-1 accuracy of CT iterations on the CLIP and DINOv2 views for the ImageNet 10% dataset.
  • Figure 3: MCT using the CLIP and DINOv2 views as a function of the training step. Models are trained on 10% of the ImageNet labels.
  • Figure 4: The top-1 accuracy of the CT predictions for each iteration of CT. 10% of available Food101 labels are used for training. The method exhibits performance improvement over multiple iterations of pseudo-labeling and retraining.
  • Figure 5: Top-1 accuracy of CT iterations on the CLIP and DINOv2 views for the ImageNet 1% dataset.
  • ...and 11 more figures