Table of Contents
Fetching ...

From Obstacles to Resources: Semi-supervised Learning Faces Synthetic Data Contamination

Zerun Wang, Jiafeng Mao, Liuyu Xiang, Toshihiko Yamasaki

TL;DR

This work addresses the problem of semi-supervised learning when unlabeled data are contaminated with synthetic images produced by diffusion models. It introduces Real-Synthetic Hybrid SSL (RS-SSL) and a benchmarking setup that reveals current SSL methods struggle or even degrade in the presence of synthetic data. To tackle this, the authors propose RSMatch, which first identifies synthetic data via a lightweight detector and a class-wise queue, then uses a dummy head to exploit synthetic samples for SSL while preserving the original classifier behavior. Experiments across CIFAR-10/100, TinyImageNet, and ImageNet show RSMatch yields consistent gains from synthetic unlabeled data and demonstrates practical benefits for learning from publicly sourced unlabeled data.

Abstract

Semi-supervised learning (SSL) can improve model performance by leveraging unlabeled images, which can be collected from public image sources with low costs. In recent years, synthetic images have become increasingly common in public image sources due to rapid advances in generative models. Therefore, it is becoming inevitable to include existing synthetic images in the unlabeled data for SSL. How this kind of contamination will affect SSL remains unexplored. In this paper, we introduce a new task, Real-Synthetic Hybrid SSL (RS-SSL), to investigate the impact of unlabeled data contaminated by synthetic images for SSL. First, we set up a new RS-SSL benchmark to evaluate current SSL methods and found they struggled to improve by unlabeled synthetic images, sometimes even negatively affected. To this end, we propose RSMatch, a novel SSL method specifically designed to handle the challenges of RS-SSL. RSMatch effectively identifies unlabeled synthetic data and further utilizes them for improvement. Extensive experimental results show that RSMatch can transfer synthetic unlabeled data from `obstacles' to `resources.' The effectiveness is further verified through ablation studies and visualization.

From Obstacles to Resources: Semi-supervised Learning Faces Synthetic Data Contamination

TL;DR

This work addresses the problem of semi-supervised learning when unlabeled data are contaminated with synthetic images produced by diffusion models. It introduces Real-Synthetic Hybrid SSL (RS-SSL) and a benchmarking setup that reveals current SSL methods struggle or even degrade in the presence of synthetic data. To tackle this, the authors propose RSMatch, which first identifies synthetic data via a lightweight detector and a class-wise queue, then uses a dummy head to exploit synthetic samples for SSL while preserving the original classifier behavior. Experiments across CIFAR-10/100, TinyImageNet, and ImageNet show RSMatch yields consistent gains from synthetic unlabeled data and demonstrates practical benefits for learning from publicly sourced unlabeled data.

Abstract

Semi-supervised learning (SSL) can improve model performance by leveraging unlabeled images, which can be collected from public image sources with low costs. In recent years, synthetic images have become increasingly common in public image sources due to rapid advances in generative models. Therefore, it is becoming inevitable to include existing synthetic images in the unlabeled data for SSL. How this kind of contamination will affect SSL remains unexplored. In this paper, we introduce a new task, Real-Synthetic Hybrid SSL (RS-SSL), to investigate the impact of unlabeled data contaminated by synthetic images for SSL. First, we set up a new RS-SSL benchmark to evaluate current SSL methods and found they struggled to improve by unlabeled synthetic images, sometimes even negatively affected. To this end, we propose RSMatch, a novel SSL method specifically designed to handle the challenges of RS-SSL. RSMatch effectively identifies unlabeled synthetic data and further utilizes them for improvement. Extensive experimental results show that RSMatch can transfer synthetic unlabeled data from `obstacles' to `resources.' The effectiveness is further verified through ablation studies and visualization.
Paper Structure (17 sections, 7 equations, 7 figures, 5 tables)

This paper contains 17 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The task setting of Real-Synthetic Hybrid SSL (RS-SSL) compared with previous SSL tasks. The unlabeled dataset for SSL is contaminated by existing synthetic images uploaded to public image sources. Note that there is no annotation for real or synthetic in unlabeled data.
  • Figure 2: Examples of generation results for class 'airplane' and 'dog' in CIFAR-10 krizhevsky2009learning with the SD14 model rombach2022high. The prompts under the images are automatically generated by the T5 model raffel2020exploring. Note that besides images matching the target class, there are also some images with semantic bias and even some entirely unrelated images. This makes our RS-SSL benchmark closer to the practical scenario.
  • Figure 3: T-SNE van2008visualizing visualization of real and synthetic unlabeled samples to show the distribution bias. The samples are from three random classes in our benchmark with the CIFAR-10 dataset and an additional 50% synthetic images in the unlabeled data (i.e., $\alpha=0.5$). The model is trained with FixMatch sohn2020fixmatch.
  • Figure 4: The RSMatch framework. Left: A lightweight deepfake detector is trained to identify real and synthetic images in the unlabeled data batch. The labeled synthetic data for supervision comes from the CSQueue, which is proposed for mining and storing synthetic data from unlabeled data. Right: The classifier with the real and dummy head for self-training on the identified real and synthetic unlabeled data separately. The two networks are trained simultaneously.
  • Figure 5: Updating strategy of the CSQueue. The unlabeled data are pseudo-labeled by the classifier and then sorted by the confidence score from the deepfake detector. We select $P$ classes and push the top-$Q$ images for each sub-queue in one iteration.
  • ...and 2 more figures