MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation
Jon Muhovič, Janez Perš
TL;DR
The paper introduces MULTIAQUA, a publicly available multimodal maritime dataset with synchronized RGB, thermal, IR, LIDAR, radar, GPS/IMU data and pixel-level annotations to support robust semantic segmentation under challenging conditions. It proposes architecture and training refinements—most notably a double forward-pass and modality-specific decoder heads—that enable daytime-trained models to leverage auxiliary modalities (thermal, LIDAR) to maintain performance in near-darkness. Experiments show substantial nighttime improvements across CMNeXt, MMSFormer, and StitchFusion, and demonstrate generalization to other multimodal datasets. The work highlights practical benefits for safe autonomous maritime navigation and points to future enhancements in data degradation modeling and sensor-quality awareness.
Abstract
Unmanned surface vehicles can encounter a number of varied visual circumstances during operation, some of which can be very difficult to interpret. While most cases can be solved only using color camera images, some weather and lighting conditions require additional information. To expand the available maritime data, we present a novel multimodal maritime dataset MULTIAQUA (Multimodal Aquatic Dataset). Our dataset contains synchronized, calibrated and annotated data captured by sensors of different modalities, such as RGB, thermal, IR, LIDAR, etc. The dataset is aimed at developing supervised methods that can extract useful information from these modalities in order to provide a high quality of scene interpretation regardless of potentially poor visibility conditions. To illustrate the benefits of the proposed dataset, we evaluate several multimodal methods on our difficult nighttime test set. We present training approaches that enable multimodal methods to be trained in a more robust way, thus enabling them to retain reliable performance even in near-complete darkness. Our approach allows for training a robust deep neural network only using daytime images, thus significantly simplifying data acquisition, annotation, and the training process.
