Improvements & Evaluations on the MLCommons CloudMask Benchmark
Varshitha Chennamsetti, Laiba Mehnaz, Dan Zhao, Banani Ghosh, Sergey V. Samsonau
TL;DR
The paper benchmarks MLCommons' cloud-masking task on the NYU Greene HPC using a U-Net baseline and updated tooling. It evaluates both pixel-level accuracy and computational performance across five seeded runs, reporting a best model at epoch 147 with train accuracy $0.909$ and test accuracy $0.896$, and an average accuracy of $0.889$ across runs. Key contributions include code improvements (accurate logging, early stopping, seed control, parallel-job automation) and reproducible benchmarking results, with code made available via GitHub to support MLCommons' benchmarks. This work advances reproducibility and scalability assessments for cloud-masking benchmarks in HPC environments, informing future benchmark developments and practical deployment.
Abstract
In this paper, we report the performance benchmarking results of deep learning models on MLCommons' Science cloud-masking benchmark using a high-performance computing cluster at New York University (NYU): NYU Greene. MLCommons is a consortium that develops and maintains several scientific benchmarks that can benefit from developments in AI. We provide a description of the cloud-masking benchmark task, updated code, and the best model for this benchmark when using our selected hyperparameter settings. Our benchmarking results include the highest accuracy achieved on the NYU system as well as the average time taken for both training and inference on the benchmark across several runs/seeds. Our code can be found on GitHub. MLCommons team has been kept informed about our progress and may use the developed code for their future work.
