Table of Contents
Fetching ...

Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data

Abdulrahman Kerim, Leandro Soriano Marcolino, Erickson R. Nascimento, Richard Jiang

TL;DR

The paper tackles the challenge of leveraging synthetic data to close the real-vs-synthetic domain gap in supervised learning. It introduces a dynamic usability framework built on a two-component metric (DPS and FCS) and an adaptive UCB-based training loop that selects the most informative synthetic samples at each epoch. The methodology combines an LLM-guided attribute extraction with Stable Diffusion to generate diverse, high-quality data and evaluates usability with a U score defined by $U = ΨΦ$, where $Φ$ depends on $D_{KL}$ between real-class and synthetic features. Empirical results show up to 10% improvements in classification accuracy across multiple architectures and tasks, validating the approach and its potential for scalable, data-efficient learning with synthetic data.

Abstract

Supervised machine learning methods require large-scale training datasets to perform well in practice. Synthetic data has been showing great progress recently and has been used as a complement to real data. However, there is yet a great urge to assess the usability of synthetically generated data. To this end, we propose a novel UCB-based training procedure combined with a dynamic usability metric. Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets, surpassing existing traditional metrics. By utilizing a UCB-based dynamic approach ensures continual enhancement of model learning. Unlike other approaches, our method effectively adapts to changes in the machine learning model's state and considers the evolving utility of training samples during the training process. We show that our metric is an effective way to rank synthetic images based on their usability. Furthermore, we propose a new attribute-aware bandit pipeline for generating synthetic data by integrating a Large Language Model with Stable Diffusion. Quantitative results show that our approach can boost the performance of a wide range of supervised classifiers. Notably, we observed an improvement of up to 10% in classification accuracy compared to traditional approaches, demonstrating the effectiveness of our approach. Our source code, datasets, and additional materials are publically available at https://github.com/A-Kerim/Synthetic-Data-Usability-2024.

Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data

TL;DR

The paper tackles the challenge of leveraging synthetic data to close the real-vs-synthetic domain gap in supervised learning. It introduces a dynamic usability framework built on a two-component metric (DPS and FCS) and an adaptive UCB-based training loop that selects the most informative synthetic samples at each epoch. The methodology combines an LLM-guided attribute extraction with Stable Diffusion to generate diverse, high-quality data and evaluates usability with a U score defined by , where depends on between real-class and synthetic features. Empirical results show up to 10% improvements in classification accuracy across multiple architectures and tasks, validating the approach and its potential for scalable, data-efficient learning with synthetic data.

Abstract

Supervised machine learning methods require large-scale training datasets to perform well in practice. Synthetic data has been showing great progress recently and has been used as a complement to real data. However, there is yet a great urge to assess the usability of synthetically generated data. To this end, we propose a novel UCB-based training procedure combined with a dynamic usability metric. Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets, surpassing existing traditional metrics. By utilizing a UCB-based dynamic approach ensures continual enhancement of model learning. Unlike other approaches, our method effectively adapts to changes in the machine learning model's state and considers the evolving utility of training samples during the training process. We show that our metric is an effective way to rank synthetic images based on their usability. Furthermore, we propose a new attribute-aware bandit pipeline for generating synthetic data by integrating a Large Language Model with Stable Diffusion. Quantitative results show that our approach can boost the performance of a wide range of supervised classifiers. Notably, we observed an improvement of up to 10% in classification accuracy compared to traditional approaches, demonstrating the effectiveness of our approach. Our source code, datasets, and additional materials are publically available at https://github.com/A-Kerim/Synthetic-Data-Usability-2024.

Paper Structure

This paper contains 16 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Calculating Usability Score $\boldsymbol{U}$ for Synthetic Images. Each synthetic image $I_i$ is assigned a usability score $\boldsymbol{U}$, which is derived from pixel-level and high-level information.
  • Figure 2: Samples from the synthetically generated datasets using our generation pipeline. Photorealistic and artistic samples are shown in the first and second columns of each dataset, respectively.
  • Figure 3: Top usable synthetic images on three synthetic datasets selected based on various metrics. Traditional metrics struggle to consistently identify diverse and photorealistic images. In contrast, our approach (last row) effectively filters and highlights the most usable synthetic images. Best viewed in color and with zoom.
  • Figure 4: Comparison of common examples identified by our metric (i.e., Mean(DPS, FCS)) versus other metrics across three synthetic datasets: SP-Car-2, SP-CIFAR-10, and SP-Birds-525.