WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Benjamin Feuer, Chinmay Hegde
TL;DR
The paper tackles the limited open-science landscape of LLM post-training by introducing WildChat-50m, the largest publicly available synthetic chat dataset generated from 54 DGMs, and by proposing Re-Wild, a targeted SFT data mix. Through large-scale analysis of SDQ across diverse DGMs and rigorous SFT experiments, the authors show that the source DGM critically shapes downstream performance, that scaling data helps but benefits depend on SDQ, and that certain SDQ traits (e.g., comprehensiveness and tone) are heritable through post-training. They also reveal that model blending offers little advantage and that judge biases influence open benchmarks, underscoring the need for standardized evaluation like Evalchemy. Overall, WildChat-50m and Re-Wild provide a practical, open framework for systematic, replicable studies of synthetic data in post-training with significant implications for building stronger generalist LLMs.
Abstract
Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.
