Multiple Random Masking Autoencoder Ensembles for Robust Multimodal Semi-supervised Learning
Alexandru-Raul Todoran, Marius Leordeanu
TL;DR
The paper tackles learning across multiple data modalities with missing observations by introducing MR-MAE, a framework that uses multiple random masking of features to train a flexible, task-agnostic predictor that implicitly forms a large ensemble of input–output mappings. It adds an automatic feature-importance mechanism via a Loss Matrix and enables semi-supervised learning through ensemble-based pseudo-labels. The authors validate MR-MAE on NASA's Earth Observation NEO dataset with 19 layers, demonstrating robustness to missing data, competitive performance against a multi-task hyper-graph model, and clear advantages in feature interpretation and climate insight discovery. The work suggests practical climate science applications and points to future improvements by incorporating stronger backbones such as Transformer architectures.
Abstract
There is an increasing number of real-world problems in computer vision and machine learning requiring to take into consideration multiple interpretation layers (modalities or views) of the world and learn how they relate to each other. For example, in the case of Earth Observations from satellite data, it is important to be able to predict one observation layer (e.g. vegetation index) from other layers (e.g. water vapor, snow cover, temperature etc), in order to best understand how the Earth System functions and also be able to reliably predict information for one layer when the data is missing (e.g. due to measurement failure or error).
