Table of Contents
Fetching ...

MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation

Jon Muhovič, Janez Perš

TL;DR

The paper introduces MULTIAQUA, a publicly available multimodal maritime dataset with synchronized RGB, thermal, IR, LIDAR, radar, GPS/IMU data and pixel-level annotations to support robust semantic segmentation under challenging conditions. It proposes architecture and training refinements—most notably a double forward-pass and modality-specific decoder heads—that enable daytime-trained models to leverage auxiliary modalities (thermal, LIDAR) to maintain performance in near-darkness. Experiments show substantial nighttime improvements across CMNeXt, MMSFormer, and StitchFusion, and demonstrate generalization to other multimodal datasets. The work highlights practical benefits for safe autonomous maritime navigation and points to future enhancements in data degradation modeling and sensor-quality awareness.

Abstract

Unmanned surface vehicles can encounter a number of varied visual circumstances during operation, some of which can be very difficult to interpret. While most cases can be solved only using color camera images, some weather and lighting conditions require additional information. To expand the available maritime data, we present a novel multimodal maritime dataset MULTIAQUA (Multimodal Aquatic Dataset). Our dataset contains synchronized, calibrated and annotated data captured by sensors of different modalities, such as RGB, thermal, IR, LIDAR, etc. The dataset is aimed at developing supervised methods that can extract useful information from these modalities in order to provide a high quality of scene interpretation regardless of potentially poor visibility conditions. To illustrate the benefits of the proposed dataset, we evaluate several multimodal methods on our difficult nighttime test set. We present training approaches that enable multimodal methods to be trained in a more robust way, thus enabling them to retain reliable performance even in near-complete darkness. Our approach allows for training a robust deep neural network only using daytime images, thus significantly simplifying data acquisition, annotation, and the training process.

MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation

TL;DR

The paper introduces MULTIAQUA, a publicly available multimodal maritime dataset with synchronized RGB, thermal, IR, LIDAR, radar, GPS/IMU data and pixel-level annotations to support robust semantic segmentation under challenging conditions. It proposes architecture and training refinements—most notably a double forward-pass and modality-specific decoder heads—that enable daytime-trained models to leverage auxiliary modalities (thermal, LIDAR) to maintain performance in near-darkness. Experiments show substantial nighttime improvements across CMNeXt, MMSFormer, and StitchFusion, and demonstrate generalization to other multimodal datasets. The work highlights practical benefits for safe autonomous maritime navigation and points to future enhancements in data degradation modeling and sensor-quality awareness.

Abstract

Unmanned surface vehicles can encounter a number of varied visual circumstances during operation, some of which can be very difficult to interpret. While most cases can be solved only using color camera images, some weather and lighting conditions require additional information. To expand the available maritime data, we present a novel multimodal maritime dataset MULTIAQUA (Multimodal Aquatic Dataset). Our dataset contains synchronized, calibrated and annotated data captured by sensors of different modalities, such as RGB, thermal, IR, LIDAR, etc. The dataset is aimed at developing supervised methods that can extract useful information from these modalities in order to provide a high quality of scene interpretation regardless of potentially poor visibility conditions. To illustrate the benefits of the proposed dataset, we evaluate several multimodal methods on our difficult nighttime test set. We present training approaches that enable multimodal methods to be trained in a more robust way, thus enabling them to retain reliable performance even in near-complete darkness. Our approach allows for training a robust deep neural network only using daytime images, thus significantly simplifying data acquisition, annotation, and the training process.

Paper Structure

This paper contains 26 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Examples of data from our dataset and corresponding semantic labels. The upper right image depicts LIDAR points overlaid on the RGB image, while the bottom row shows thermal camera image and semantic annotations overlays, respectively. The semantic labels of sky, water and static obstacle are denoted with purple, blue, and green respectively.
  • Figure 2: Examples of different sensor modalities included in our sensor system. Note the different resolution, focal length, and aspect ratio of the images. The LIDAR points in the upper-left image are colored based on distance. The radar points in the bottom-left image are depicted with purple.
  • Figure 3: Examples of images from our dataset, captured at different locations and under different circumstances. Note the high contrast, sun glare, cluttered environments and low-light conditions.
  • Figure 4: Radar chart of the experimental results on MULTIAQUA dataset. CMNeXt, MMSFormer, and StitchFormer variants are shown with solid, dashed lines, and dotted lines respectively. The scores shown are mIoU scores on validation and test sets, as well as IoU performances per semantic class (on the test set). The ordering of method variants matches the one in Table \ref{['tab:results']}. The value range of each axis is scaled based on the corresponding data for clarity.
  • Figure 5: Example images from the nighttime test set. The first column shows raw RGB images, the second column shows thermal images overlaid on RGB images, and the third column shows ground truth annotations. The last three columns show the predictions of best-performing nighttime models: CMNeXt-DH, MMSFormer-D, and StitchFusion-D. Semantic labels for sky, static obstacle, water and dynamic obstacle classes are shown in cyan, green, blue and red, respectively.