Table of Contents
Fetching ...

MCTED: A Machine-Learning-Ready Dataset for Digital Elevation Model Generation From Mars Imagery

Rafał Osadnik, Pablo Gómez, Eleni Bohacek, Rickbir Bahia

TL;DR

This paper introduces MCTED, a machine-learning-ready dataset of Mars CTX-derived orthoimage–DEM patches designed for single-image DEM generation. It systematically cleans and curates 80,898 samples from the day2023mars repository, addressing data quality issues and producing patch- and mask-bearing inputs suitable for supervised learning, with cluster-based train/validation splits to avoid data leakage. A simple U-Net baseline trained on MCTED outperforms a large monocular depth estimation foundation model (DepthAnythingV2) on DEM prediction, highlighting a domain gap and the need for domain-specific training data. The dataset and accompanying code are openly available, enabling further development of efficient, high-resolution Martian DEM generation from single imagery. The work provides a practical path toward higher-resolution global Mars DEMs using monocular cues, with clear limitations and directions for future improvement.

Abstract

This work presents a new dataset for the Martian digital elevation model prediction task, ready for machine learning applications called MCTED. The dataset has been generated using a comprehensive pipeline designed to process high-resolution Mars orthoimage and DEM pairs from Day et al., yielding a dataset consisting of 80,898 data samples. The source images are data gathered by the Mars Reconnaissance Orbiter using the CTX instrument, providing a very diverse and comprehensive coverage of the Martian surface. Given the complexity of the processing pipelines used in large-scale DEMs, there are often artefacts and missing data points in the original data, for which we developed tools to solve or mitigate their impact. We divide the processed samples into training and validation splits, ensuring samples in both splits cover no mutual areas to avoid data leakage. Every sample in the dataset is represented by the optical image patch, DEM patch, and two mask patches, indicating values that were originally missing or were altered by us. This allows future users of the dataset to handle altered elevation regions as they please. We provide statistical insights of the generated dataset, including the spatial distribution of samples, the distributions of elevation values, slopes and more. Finally, we train a small U-Net architecture on the MCTED dataset and compare its performance to a monocular depth estimation foundation model, DepthAnythingV2, on the task of elevation prediction. We find that even a very small architecture trained on this dataset specifically, beats a zero-shot performance of a depth estimation foundation model like DepthAnythingV2. We make the dataset and code used for its generation completely open source in public repositories.

MCTED: A Machine-Learning-Ready Dataset for Digital Elevation Model Generation From Mars Imagery

TL;DR

This paper introduces MCTED, a machine-learning-ready dataset of Mars CTX-derived orthoimage–DEM patches designed for single-image DEM generation. It systematically cleans and curates 80,898 samples from the day2023mars repository, addressing data quality issues and producing patch- and mask-bearing inputs suitable for supervised learning, with cluster-based train/validation splits to avoid data leakage. A simple U-Net baseline trained on MCTED outperforms a large monocular depth estimation foundation model (DepthAnythingV2) on DEM prediction, highlighting a domain gap and the need for domain-specific training data. The dataset and accompanying code are openly available, enabling further development of efficient, high-resolution Martian DEM generation from single imagery. The work provides a practical path toward higher-resolution global Mars DEMs using monocular cues, with clear limitations and directions for future improvement.

Abstract

This work presents a new dataset for the Martian digital elevation model prediction task, ready for machine learning applications called MCTED. The dataset has been generated using a comprehensive pipeline designed to process high-resolution Mars orthoimage and DEM pairs from Day et al., yielding a dataset consisting of 80,898 data samples. The source images are data gathered by the Mars Reconnaissance Orbiter using the CTX instrument, providing a very diverse and comprehensive coverage of the Martian surface. Given the complexity of the processing pipelines used in large-scale DEMs, there are often artefacts and missing data points in the original data, for which we developed tools to solve or mitigate their impact. We divide the processed samples into training and validation splits, ensuring samples in both splits cover no mutual areas to avoid data leakage. Every sample in the dataset is represented by the optical image patch, DEM patch, and two mask patches, indicating values that were originally missing or were altered by us. This allows future users of the dataset to handle altered elevation regions as they please. We provide statistical insights of the generated dataset, including the spatial distribution of samples, the distributions of elevation values, slopes and more. Finally, we train a small U-Net architecture on the MCTED dataset and compare its performance to a monocular depth estimation foundation model, DepthAnythingV2, on the task of elevation prediction. We find that even a very small architecture trained on this dataset specifically, beats a zero-shot performance of a depth estimation foundation model like DepthAnythingV2. We make the dataset and code used for its generation completely open source in public repositories.

Paper Structure

This paper contains 39 sections, 5 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Every part of a single day2023mars sample available in the repository. day2023mars sample name: b02_010423_1720_xi_08s084w_b18_016726_1719_xn_08s084w
  • Figure 2: Localisation of all CTX-derived orthoimages and DEMs in the repository on the surface of Mars.
  • Figure 3: The same terrain fragment as it appears in all three data types in the original form. Discrepancies between the resolutions and orientation can easily be seen. The data types require co-registration and re-scaling to a common resolution in order to superimpose them. Because of the different native resolutions of each data type, it's impossible to exactly match all of the features on each data type; all of them represent a different level of detail that is irreversibly lost in lower resolutions. This mismatch between the data types makes precise superimposition impossible. day2023mars sample name: b07_012511_1819_xn_01n211w_b07_012234_1820_xn_02n211w
  • Figure 4: Elevation artefacts on elevation maps in the repository
  • Figure 5: Histograms of the height and width ratios between the optical images and DEMs in the day2023mars repository. Image resolutions are roughly three times the size of the DEMs in most cases.
  • ...and 14 more figures