Table of Contents
Fetching ...

Data Distillation: Towards Omni-Supervised Learning

Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, Kaiming He

TL;DR

The paper tackles the challenge of leveraging unlimited unlabeled data alongside labeled datasets to improve visual recognition, introducing omni-supervised learning and a simple, scalable data distillation pipeline. Data distillation generates hard training labels from unlabeled data by ensembling predictions from a single model over multiple geometric transformations, then retrains a student on the combined labeled and distilled data without altering model architecture or losses. Experiments on COCO keypoint and object detection show consistent performance gains over strong fully supervised baselines across small and large-scale settings, including distribution shifts between labeled and unlabeled data. The results demonstrate that carefully crafted self-training with multi-transform inferences can meaningfully leverage internet-scale unlabeled data to surpass state-of-the-art supervised methods in real-world vision tasks.

Abstract

We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lower-bounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully supervised methods. To exploit the omni-supervised setting, we propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations. We argue that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging real-world data. Our experimental results show that in the cases of human keypoint detection and general object detection, state-of-the-art models trained with data distillation surpass the performance of using labeled data from the COCO dataset alone.

Data Distillation: Towards Omni-Supervised Learning

TL;DR

The paper tackles the challenge of leveraging unlimited unlabeled data alongside labeled datasets to improve visual recognition, introducing omni-supervised learning and a simple, scalable data distillation pipeline. Data distillation generates hard training labels from unlabeled data by ensembling predictions from a single model over multiple geometric transformations, then retrains a student on the combined labeled and distilled data without altering model architecture or losses. Experiments on COCO keypoint and object detection show consistent performance gains over strong fully supervised baselines across small and large-scale settings, including distribution shifts between labeled and unlabeled data. The results demonstrate that carefully crafted self-training with multi-transform inferences can meaningfully leverage internet-scale unlabeled data to surpass state-of-the-art supervised methods in real-world vision tasks.

Abstract

We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lower-bounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully supervised methods. To exploit the omni-supervised setting, we propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations. We argue that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging real-world data. Our experimental results show that in the cases of human keypoint detection and general object detection, state-of-the-art models trained with data distillation surpass the performance of using labeled data from the COCO dataset alone.

Paper Structure

This paper contains 33 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Model Distillation Hinton2015vs. Data Distillation. In data distillation, ensembled predictions from a single model applied to multiple transformations of an unlabeled image are used as automatically annotated data for training a student model.
  • Figure 2: Ensembling keypoint predictions from multiple data transformations can yield a single superior (automatic) annotation. For visualization purposes all images and keypoint predictions are transformed back to their original coordinate frame.
  • Figure 3: Random examples of annotations generated on static Sports-1M Karpathy2014 frames using a ResNet-50 teacher. The generated annotations have reasonably high quality, though as expected there are mistakes like inverted keypoints (top right).
  • Figure 4: Selected results of fully-supervised learning in the original co-115 set (top) vs. data distillation in the co-115 plus s1m-180 sets (bottom). The results are on the held-out data from COCO test-dev.
  • Figure 5: Data distillation applied to co-115 with labels and different fractions of un-120 images without labels, comparing with the co-115 fully-supervised baseline, using ResNet-50.
  • ...and 2 more figures