Table of Contents
Fetching ...

The Data Fusion Labeler (dFL): Challenges and Solutions to Data Harmonization, Labeling, and Provenance in Fusion Energy

Craig Michoski, Matthew Waller, Brian Sammuli, Zeyu Li, Tapan Ganatma Nakkina, Raffi Nazikian, Sterling Smith, David Orozco, Dongyang Kuang, Martin Foltin, Erik Olofsson, Mike Fredrickson, Jerry Louis-Jeune, David R. Hatch, Todd A. Oliver, Mitchell Clark, Steph-Yves Louis

TL;DR

The paper tackles the problem of turning petabyte-scale, heterogeneous fusion-energy data into reliable physics insight. It introduces the Data Fusion Labeler (dFL), a workflow that unifies data harmonization, data fusion, and provenance with uncertainty-aware labeling, enabling rapid, reproducible analysis and cross-device comparability. Through case studies on automated ELM labeling and confinement-regime identification in DIII-D, it demonstrates how principled harmonization and integrated provenance improve label quality and support large-scale, archival labeling and real-time control integration. The work emphasizes operator-order awareness and schema-standardization (IMAS/OMAS) as foundational for scalable, reproducible fusion informatics and highlights the broader relevance of these principles to other data-rich scientific domains.

Abstract

Fusion energy research increasingly depends on the ability to integrate heterogeneous, multimodal datasets from high-resolution diagnostics, control systems, and multiscale simulations. The sheer volume and complexity of these datasets demand the development of new tools capable of systematically harmonizing and extracting knowledge across diverse modalities. The Data Fusion Labeler (dFL) is introduced as a unified workflow instrument that performs uncertainty-aware data harmonization, schema-compliant data fusion, and provenance-rich manual and automated labeling at scale. By embedding alignment, normalization, and labeling within a reproducible, operator-order-aware framework, dFL reduces time-to-analysis by greater than 50X (e.g., enabling >200 shots/hour to be consistently labeled rather than a handful per day), enhances label (and subsequently training) quality, and enables cross-device comparability. Case studies from DIII-D demonstrate its application to automated ELM detection and confinement regime classification, illustrating its potential as a core component of data-driven discovery, model validation, and real-time control in future burning plasma devices.

The Data Fusion Labeler (dFL): Challenges and Solutions to Data Harmonization, Labeling, and Provenance in Fusion Energy

TL;DR

The paper tackles the problem of turning petabyte-scale, heterogeneous fusion-energy data into reliable physics insight. It introduces the Data Fusion Labeler (dFL), a workflow that unifies data harmonization, data fusion, and provenance with uncertainty-aware labeling, enabling rapid, reproducible analysis and cross-device comparability. Through case studies on automated ELM labeling and confinement-regime identification in DIII-D, it demonstrates how principled harmonization and integrated provenance improve label quality and support large-scale, archival labeling and real-time control integration. The work emphasizes operator-order awareness and schema-standardization (IMAS/OMAS) as foundational for scalable, reproducible fusion informatics and highlights the broader relevance of these principles to other data-rich scientific domains.

Abstract

Fusion energy research increasingly depends on the ability to integrate heterogeneous, multimodal datasets from high-resolution diagnostics, control systems, and multiscale simulations. The sheer volume and complexity of these datasets demand the development of new tools capable of systematically harmonizing and extracting knowledge across diverse modalities. The Data Fusion Labeler (dFL) is introduced as a unified workflow instrument that performs uncertainty-aware data harmonization, schema-compliant data fusion, and provenance-rich manual and automated labeling at scale. By embedding alignment, normalization, and labeling within a reproducible, operator-order-aware framework, dFL reduces time-to-analysis by greater than 50X (e.g., enabling >200 shots/hour to be consistently labeled rather than a handful per day), enhances label (and subsequently training) quality, and enables cross-device comparability. Case studies from DIII-D demonstrate its application to automated ELM detection and confinement regime classification, illustrating its potential as a core component of data-driven discovery, model validation, and real-time control in future burning plasma devices.

Paper Structure

This paper contains 57 sections, 22 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Overview of the Data Fusion Labeler (dFL) workflow.
  • Figure 2: An image showing the fill capability on a signal with NaNs scattered throughout. The top image is the NaN-riddled plasma current $i_p$ signal before fill, and the bottom image is after dFL fill is applied.
  • Figure 3: Here we show one of the natively supported frequency space graphs, a wavelet transform, with Window Size=256, Overlap=50%, and Morlet Width Param=5, for the plasma current $i_{p}$ (Shot #149058).
  • Figure 4: Here we show a custom graph type incorporated using the python-based Modespyec program for magnetic mode analysis. The toroidal modes are set by contour colors (Shot #149091).
  • Figure 5: The same 100 point data segment shown in figure \ref{['fig:fill']}, first filled, then on top downsampled to 10 points using importance downsampling preserving the first two central moments, and on bottom upsampled to 1000 points using Mono-PCHIP.
  • ...and 10 more figures