Table of Contents
Fetching ...

Improving Radiography Machine Learning Workflows via Metadata Management for Training Data Selection

Mirabel Reid, Christine Sweeney, Oleg Korobkin

TL;DR

The paper tackles the challenge of managing ML metadata in physical sciences by developing a domain-specific tool for dynamic radiography. It integrates interactive visualization, visual queries for training-data selection, and centralized metadata tracking via a SQLite backend to support reproducibility. The study demonstrates improved data exploration, more efficient training-data selection, and insights into parameter sensitivity and degeneracy in density-field reconstruction. This approach enables scientists to iteratively refine training datasets while preserving the provenance of decisions, with potential extension to broader scientific workflows.

Abstract

Most machine learning models require many iterations of hyper-parameter tuning, feature engineering, and debugging to produce effective results. As machine learning models become more complicated, this pipeline becomes more difficult to manage effectively. In the physical sciences, there is an ever-increasing pool of metadata that is generated by the scientific research cycle. Tracking this metadata can reduce redundant work, improve reproducibility, and aid in the feature and training dataset engineering process. In this case study, we present a tool for machine learning metadata management in dynamic radiography. We evaluate the efficacy of this tool against the initial research workflow and discuss extensions to general machine learning pipelines in the physical sciences.

Improving Radiography Machine Learning Workflows via Metadata Management for Training Data Selection

TL;DR

The paper tackles the challenge of managing ML metadata in physical sciences by developing a domain-specific tool for dynamic radiography. It integrates interactive visualization, visual queries for training-data selection, and centralized metadata tracking via a SQLite backend to support reproducibility. The study demonstrates improved data exploration, more efficient training-data selection, and insights into parameter sensitivity and degeneracy in density-field reconstruction. This approach enables scientists to iteratively refine training datasets while preserving the provenance of decisions, with potential extension to broader scientific workflows.

Abstract

Most machine learning models require many iterations of hyper-parameter tuning, feature engineering, and debugging to produce effective results. As machine learning models become more complicated, this pipeline becomes more difficult to manage effectively. In the physical sciences, there is an ever-increasing pool of metadata that is generated by the scientific research cycle. Tracking this metadata can reduce redundant work, improve reproducibility, and aid in the feature and training dataset engineering process. In this case study, we present a tool for machine learning metadata management in dynamic radiography. We evaluate the efficacy of this tool against the initial research workflow and discuss extensions to general machine learning pipelines in the physical sciences.
Paper Structure (16 sections, 1 equation, 9 figures)

This paper contains 16 sections, 1 equation, 9 figures.

Figures (9)

  • Figure 1: A flowchart describing a generic pipeline for learning on simulation data. As the objectives change over time, each step may be repeated and fine-tuned. Ovals indicate the steps of machine learning training. Each step is marked with a cylinder to indicate that metadata generated from that step is tracked in an external store.
  • Figure 2: A simulated shell implosion at four different time steps. The bottom image shows the original radiograph, and the top image shows the extracted density. The shock, which expands outward over time, is labeled on each radiograph with a dotted line. Image courtesy of hossain2022high.
  • Figure 3: The initial page of the training data selection tool GUI. At the top, the "Sample Ground Truth" dataset is selected, and the user is viewing the l2 norm between that ground truth dataset and the other available simulation data at time 40. All available simulation shock, edge and density differences as well as simulation parameter metadata are available for selection in the parallel coordinates plot at the top. Here all values are selected. The scatter plot below allows the user to make the comparison with adjustments of show and edge contributions.
  • Figure 4: Demonstration of the slider options. In this image, the user has altered the weight of $\delta$edge using a slider and changed the coloring of the scatter plot from 'color by profile' to 'color by s1'. The user can also change the time step at which the comparison to the ground truth was computed using the Time Slider.
  • Figure 5: Demonstration of the parameter selection options. The user has selected the ranges profile=0 and s1=0 on the parallel coordinates plot, which filtered the points on the scatter plot. The user has also selected a subset of points on the scatter plot, which highlighted the corresponding lines on the parallel coordinates plot.
  • ...and 4 more figures