Improving Radiography Machine Learning Workflows via Metadata Management for Training Data Selection
Mirabel Reid, Christine Sweeney, Oleg Korobkin
TL;DR
The paper tackles the challenge of managing ML metadata in physical sciences by developing a domain-specific tool for dynamic radiography. It integrates interactive visualization, visual queries for training-data selection, and centralized metadata tracking via a SQLite backend to support reproducibility. The study demonstrates improved data exploration, more efficient training-data selection, and insights into parameter sensitivity and degeneracy in density-field reconstruction. This approach enables scientists to iteratively refine training datasets while preserving the provenance of decisions, with potential extension to broader scientific workflows.
Abstract
Most machine learning models require many iterations of hyper-parameter tuning, feature engineering, and debugging to produce effective results. As machine learning models become more complicated, this pipeline becomes more difficult to manage effectively. In the physical sciences, there is an ever-increasing pool of metadata that is generated by the scientific research cycle. Tracking this metadata can reduce redundant work, improve reproducibility, and aid in the feature and training dataset engineering process. In this case study, we present a tool for machine learning metadata management in dynamic radiography. We evaluate the efficacy of this tool against the initial research workflow and discuss extensions to general machine learning pipelines in the physical sciences.
