Table of Contents
Fetching ...

Dealing with Synthetic Data Contamination in Online Continual Learning

Maorong Wang, Nicolas Michel, Jiafeng Mao, Toshihiko Yamasaki

TL;DR

Experiments show that the proposed Entropy Selection with Real-synthetic similarity Maximization (ESRM), a method to alleviate the performance deterioration caused by synthetic images when training online CL models, can significantly alleviate performance deterioration.

Abstract

Image generation has shown remarkable results in generating high-fidelity realistic images, in particular with the advancement of diffusion-based models. However, the prevalence of AI-generated images may have side effects for the machine learning community that are not clearly identified. Meanwhile, the success of deep learning in computer vision is driven by the massive dataset collected on the Internet. The extensive quantity of synthetic data being added to the Internet would become an obstacle for future researchers to collect "clean" datasets without AI-generated content. Prior research has shown that using datasets contaminated by synthetic images may result in performance degradation when used for training. In this paper, we investigate the potential impact of contaminated datasets on Online Continual Learning (CL) research. We experimentally show that contaminated datasets might hinder the training of existing online CL methods. Also, we propose Entropy Selection with Real-synthetic similarity Maximization (ESRM), a method to alleviate the performance deterioration caused by synthetic images when training online CL models. Experiments show that our method can significantly alleviate performance deterioration, especially when the contamination is severe. For reproducibility, the source code of our work is available at https://github.com/maorong-wang/ESRM.

Dealing with Synthetic Data Contamination in Online Continual Learning

TL;DR

Experiments show that the proposed Entropy Selection with Real-synthetic similarity Maximization (ESRM), a method to alleviate the performance deterioration caused by synthetic images when training online CL models, can significantly alleviate performance deterioration.

Abstract

Image generation has shown remarkable results in generating high-fidelity realistic images, in particular with the advancement of diffusion-based models. However, the prevalence of AI-generated images may have side effects for the machine learning community that are not clearly identified. Meanwhile, the success of deep learning in computer vision is driven by the massive dataset collected on the Internet. The extensive quantity of synthetic data being added to the Internet would become an obstacle for future researchers to collect "clean" datasets without AI-generated content. Prior research has shown that using datasets contaminated by synthetic images may result in performance degradation when used for training. In this paper, we investigate the potential impact of contaminated datasets on Online Continual Learning (CL) research. We experimentally show that contaminated datasets might hinder the training of existing online CL methods. Also, we propose Entropy Selection with Real-synthetic similarity Maximization (ESRM), a method to alleviate the performance deterioration caused by synthetic images when training online CL models. Experiments show that our method can significantly alleviate performance deterioration, especially when the contamination is severe. For reproducibility, the source code of our work is available at https://github.com/maorong-wang/ESRM.

Paper Structure

This paper contains 38 sections, 5 equations, 11 figures, 13 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of proposed ESRM framework for online CL. The proposed ESRM framework has two main components: Entropy Selection (ES) and Real-synthetic similarity Maximization (RM). Motivated by Obs. \ref{['obs:4']} and Obs. \ref{['obs:2']}, ES is a buffer management strategy designed to use entropy as a criterion to select more real samples in the memory buffer, thereby alleviating catastrophic forgetting and performance degradation caused by the contamination. RM aims to bridge the embedding gap between synthetic and real data, as noted in Obs. \ref{['obs:3']}, using a contrastive learning technique.
  • Figure 2: The entropy distribution of the training dataset produced by ER and OnPro on In-100/SDXL ($P=50\%$) at the end of the training.
  • Figure 3: T-SNE visualization of the memory data at the end of training on In-100/SDXL ($P=50\%$). For clarity, only the first 10 classes are visualized.
  • Figure 4: Overview of the proposed Entropy Selection strategy. The color of the samples indicates the class, and the number in the samples represents the entropy predicted by the learner.
  • Figure 5: T-SNE visualization of memory data produced by ESRM at the end of training on the In-100/SDXL ($P=50\%$) dataset. For clarity, only the first 10 classes are visualized.
  • ...and 6 more figures