Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training
Wesley Brewer, Murali Meena Gopalakrishnan, Matthias Maiterth, Aditya Kashi, Jong Youl Choi, Pei Zhang, Stephen Nichols, Riccardo Balin, Miles Couchman, Stephen de Bruyn Kops, P. K. Yeung, Daniel Dotson, Rohini Uma-Vaideswaran, Sarp Oral, Feiyi Wang
TL;DR
This work tackles the data and energy bottlenecks of training spatiotemporal turbulence models by introducing SICKLE, a sparse intelligent curation framework that uses maximum entropy based sampling (MaxEnt) to select informative subsets from extreme-scale DNS data. The method comprises a two-phase MaxEnt process that first picks representative hypercubes and then sampling points within them, plus phase-space (UIPS) and temporal sampling as baselines. Across 2D and 3D turbulence datasets, SICKLE demonstrates that intelligent subsampling can achieve accuracy comparable to or better than full data training while dramatically reducing energy usage, up to 38x in some cases, and scales efficiently on Frontier hardware. The work also discusses the tradeoffs between sampling strategies, emphasizes robustness in anisotropic flows, and provides an open-source framework to support reproducibility and future integration into spatiotemporal foundation model pipelines for turbulence.
Abstract
With the end of Moore's law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can, in many cases, improve model accuracy and substantially lower energy consumption, with observed reductions of up to 38x.
