Table of Contents
Fetching ...

Does Training on Synthetic Data Make Models Less Robust?

Lingze Zhang, Ellie Pavlick

TL;DR

This study investigates whether training on synthetic data produced by the same or similar LLMs exacerbates blindspots in NLP models, using natural language inference with MultiNLI and the HANS adversarial set as a probe. The authors implement a two-model framework (task model and generator) to produce synthetic datasets and evaluate how fine-tuning on these data affects general NLI performance and sensitivity to heuristic-driven blindspots. Across multiple starting-point models and synthetic-data sizes, synthetic data improves general NLI performance comparably to original data for undertrained models, but does not consistently worsen or improve blindspot performance on HANS. A biased synthetic dataset, however, can substantially degrade blindspot detection, underscoring the dangers of unfiltered synthetic data and the need for careful data curation and broader validation beyond case studies. The findings suggest synthetic data can be a viable tool for scaling data and maintaining robustness, but require nuanced, task-specific evaluation to avoid reinforcing undesirable heuristics.

Abstract

An increasingly common practice is to train large language models (LLMs) using synthetic data. Often this synthetic data is produced by the same or similar LLMs as those it is being used to train. This raises the question of whether the synthetic data might in fact exacerbate certain "blindspots" by reinforcing heuristics that the LLM already encodes. In this paper, we conduct simulated experiments on the natural language inference (NLI) task with Llama-2-7B-hf models. We use MultiNLI as the general task and HANS, a targeted evaluation set designed to measure the presence of specific heuristic strategies for NLI, as our "blindspot" task. Our goal is to determine whether performance disparities between the general and blind spot tasks emerge. Our results indicate that synthetic data does not reinforce blindspots in the way we expected. Specifically, we see that, while fine-tuning with synthetic data doesn't necessarily reduce the use of the heuristic, it also does not make it worse as we hypothesized.

Does Training on Synthetic Data Make Models Less Robust?

TL;DR

This study investigates whether training on synthetic data produced by the same or similar LLMs exacerbates blindspots in NLP models, using natural language inference with MultiNLI and the HANS adversarial set as a probe. The authors implement a two-model framework (task model and generator) to produce synthetic datasets and evaluate how fine-tuning on these data affects general NLI performance and sensitivity to heuristic-driven blindspots. Across multiple starting-point models and synthetic-data sizes, synthetic data improves general NLI performance comparably to original data for undertrained models, but does not consistently worsen or improve blindspot performance on HANS. A biased synthetic dataset, however, can substantially degrade blindspot detection, underscoring the dangers of unfiltered synthetic data and the need for careful data curation and broader validation beyond case studies. The findings suggest synthetic data can be a viable tool for scaling data and maintaining robustness, but require nuanced, task-specific evaluation to avoid reinforcing undesirable heuristics.

Abstract

An increasingly common practice is to train large language models (LLMs) using synthetic data. Often this synthetic data is produced by the same or similar LLMs as those it is being used to train. This raises the question of whether the synthetic data might in fact exacerbate certain "blindspots" by reinforcing heuristics that the LLM already encodes. In this paper, we conduct simulated experiments on the natural language inference (NLI) task with Llama-2-7B-hf models. We use MultiNLI as the general task and HANS, a targeted evaluation set designed to measure the presence of specific heuristic strategies for NLI, as our "blindspot" task. Our goal is to determine whether performance disparities between the general and blind spot tasks emerge. Our results indicate that synthetic data does not reinforce blindspots in the way we expected. Specifically, we see that, while fine-tuning with synthetic data doesn't necessarily reduce the use of the heuristic, it also does not make it worse as we hypothesized.

Paper Structure

This paper contains 14 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Augmented model performance under different settings.