Table of Contents
Fetching ...

WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Benjamin Feuer, Chinmay Hegde

TL;DR

The paper tackles the limited open-science landscape of LLM post-training by introducing WildChat-50m, the largest publicly available synthetic chat dataset generated from 54 DGMs, and by proposing Re-Wild, a targeted SFT data mix. Through large-scale analysis of SDQ across diverse DGMs and rigorous SFT experiments, the authors show that the source DGM critically shapes downstream performance, that scaling data helps but benefits depend on SDQ, and that certain SDQ traits (e.g., comprehensiveness and tone) are heritable through post-training. They also reveal that model blending offers little advantage and that judge biases influence open benchmarks, underscoring the need for standardized evaluation like Evalchemy. Overall, WildChat-50m and Re-Wild provide a practical, open framework for systematic, replicable studies of synthetic data in post-training with significant implications for building stronger generalist LLMs.

Abstract

Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.

WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

TL;DR

The paper tackles the limited open-science landscape of LLM post-training by introducing WildChat-50m, the largest publicly available synthetic chat dataset generated from 54 DGMs, and by proposing Re-Wild, a targeted SFT data mix. Through large-scale analysis of SDQ across diverse DGMs and rigorous SFT experiments, the authors show that the source DGM critically shapes downstream performance, that scaling data helps but benefits depend on SDQ, and that certain SDQ traits (e.g., comprehensiveness and tone) are heritable through post-training. They also reveal that model blending offers little advantage and that judge biases influence open benchmarks, underscoring the need for standardized evaluation like Evalchemy. Overall, WildChat-50m and Re-Wild provide a practical, open framework for systematic, replicable studies of synthetic data in post-training with significant implications for building stronger generalist LLMs.

Abstract

Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.

Paper Structure

This paper contains 17 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Re-Wild outperforms strong baselines, on average, across nine benchmarks. In particular, it exhibits strong performance on generalist chat and instruction following benchmarks. MT Bench scores here are divided by 10, so that the scale is similar to our other evaluations. For the exact numeric scores for all models, please refer to our GitHub repository. Figure best viewed in color.
  • Figure 2: Data scaling improves SFT performance. The effect is, however, somewhat dependent on SDQ -- for DGMs such as GPT 3.5, the benefits taper off relatively quickly, but for the other three DGMs we consider, they continue to increase. Avg is the average performance over (MixEval, AlpacaEval2-LC, MTBench / 10, OpenLLM LB 2).
  • Figure 3: Key words more common in L8B : L70 judgments. The more negative tone of these judgments emphasizes words like clearer (as in, "could have been clearer"), lacks, convoluted and repetitive.
  • Figure 4: Key words more common in L8B : Q72 judgments. These judgments tended to be more positive; emphasis was placed on words like appropriate, necessary, comprehensive and accurate.