Improvements & Evaluations on the MLCommons CloudMask Benchmark

Varshitha Chennamsetti; Laiba Mehnaz; Dan Zhao; Banani Ghosh; Sergey V. Samsonau

Improvements & Evaluations on the MLCommons CloudMask Benchmark

Varshitha Chennamsetti, Laiba Mehnaz, Dan Zhao, Banani Ghosh, Sergey V. Samsonau

TL;DR

The paper benchmarks MLCommons' cloud-masking task on the NYU Greene HPC using a U-Net baseline and updated tooling. It evaluates both pixel-level accuracy and computational performance across five seeded runs, reporting a best model at epoch 147 with train accuracy $0.909$ and test accuracy $0.896$, and an average accuracy of $0.889$ across runs. Key contributions include code improvements (accurate logging, early stopping, seed control, parallel-job automation) and reproducible benchmarking results, with code made available via GitHub to support MLCommons' benchmarks. This work advances reproducibility and scalability assessments for cloud-masking benchmarks in HPC environments, informing future benchmark developments and practical deployment.

Abstract

In this paper, we report the performance benchmarking results of deep learning models on MLCommons' Science cloud-masking benchmark using a high-performance computing cluster at New York University (NYU): NYU Greene. MLCommons is a consortium that develops and maintains several scientific benchmarks that can benefit from developments in AI. We provide a description of the cloud-masking benchmark task, updated code, and the best model for this benchmark when using our selected hyperparameter settings. Our benchmarking results include the highest accuracy achieved on the NYU system as well as the average time taken for both training and inference on the benchmark across several runs/seeds. Our code can be found on GitHub. MLCommons team has been kept informed about our progress and may use the developed code for their future work.

Improvements & Evaluations on the MLCommons CloudMask Benchmark

TL;DR

and test accuracy

, and an average accuracy of

across runs. Key contributions include code improvements (accurate logging, early stopping, seed control, parallel-job automation) and reproducible benchmarking results, with code made available via GitHub to support MLCommons' benchmarks. This work advances reproducibility and scalability assessments for cloud-masking benchmarks in HPC environments, informing future benchmark developments and practical deployment.

Abstract

Paper Structure (10 sections, 4 figures, 3 tables)

This paper contains 10 sections, 4 figures, 3 tables.

Introduction
MLCommons
Related Work
Dataset
Model
Experiments
Resources & Hardware
Code Modifications
Results
Conclusion

Figures (4)

Figure 1: An illustration of how training and testing datasets are pre-processed before training/inference.
Figure 2: Training set results with different runs. With early stopping and patience of 25, the 5 different runs stop their training and save the model weights at epochs 200, 147, 162, 200, 183, respectively.
Figure 3: Validation set results across different runs. With early stopping and patience of 25, the 5 different runs stop their training and save the model weights at epochs 200, 147, 162, 200, 183, respectively.
Figure 4: Detailed loss and accuracy curves for our best trial/run (147 epochs) with an overall end train accuracy of 0.909 and test accuracy of 0.896.

Improvements & Evaluations on the MLCommons CloudMask Benchmark

TL;DR

Abstract

Improvements & Evaluations on the MLCommons CloudMask Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (4)