Table of Contents
Fetching ...

Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition

Xinyu Xi, Hua Yang, Shentai Zhang, Yijie Liu, Sijin Sun, Xiuju Fu

TL;DR

This work tackles maritime scene recognition under challenging environmental conditions by introducing a lightweight multimodal AI framework that fuses image features, MLLM-generated textual descriptions, and probabilistic classification vectors. The architecture uses a Swin Transformer for images, BERT for text, and an MLP for vectors, with a four-part fusion strategy that includes attention, weighted integration, alignment via mutual information and JS divergence, and dynamic modality prioritization, followed by a final classification layer. To enable real-time edge deployment on resource-constrained ASVs, the model employs Activation-aware Weight Quantization (AWQ), achieving a 68.75 MB footprint with only a 0.5% drop in accuracy (97.5% full AWQ) and substantial gains in throughput and memory efficiency. Experimental results on a curated maritime dataset show 98.0% full-precision accuracy, outperforming pure-vision and other multimodal baselines, and demonstrating robustness on challenging samples. The approach offers a practical pathway to reliable, real-time maritime monitoring and disaster response on edge devices, with potential for dataset expansion and semi-supervised scaling in future work.

Abstract

Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics, particularly in applications such as marine conservation, environmental monitoring, and disaster response. However, this task presents significant challenges due to environmental interference, where marine conditions degrade image quality, and the complexity of maritime scenes, which requires deeper reasoning for accurate recognition. Pure vision models alone are insufficient to address these issues. To overcome these limitations, we propose a novel multimodal Artificial Intelligence (AI) framework that integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM), to provide richer semantic understanding and improve recognition accuracy. Our framework employs an efficient multimodal fusion mechanism to further enhance model robustness and adaptability in complex maritime environments. Experimental results show that our model achieves 98$\%$ accuracy, surpassing previous SOTA models by 3.5$\%$. To optimize deployment on resource-constrained platforms, we adopt activation-aware weight quantization (AWQ) as a lightweight technique, reducing the model size to 68.75MB with only a 0.5$\%$ accuracy drop while significantly lowering computational overhead. This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.

Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition

TL;DR

This work tackles maritime scene recognition under challenging environmental conditions by introducing a lightweight multimodal AI framework that fuses image features, MLLM-generated textual descriptions, and probabilistic classification vectors. The architecture uses a Swin Transformer for images, BERT for text, and an MLP for vectors, with a four-part fusion strategy that includes attention, weighted integration, alignment via mutual information and JS divergence, and dynamic modality prioritization, followed by a final classification layer. To enable real-time edge deployment on resource-constrained ASVs, the model employs Activation-aware Weight Quantization (AWQ), achieving a 68.75 MB footprint with only a 0.5% drop in accuracy (97.5% full AWQ) and substantial gains in throughput and memory efficiency. Experimental results on a curated maritime dataset show 98.0% full-precision accuracy, outperforming pure-vision and other multimodal baselines, and demonstrating robustness on challenging samples. The approach offers a practical pathway to reliable, real-time maritime monitoring and disaster response on edge devices, with potential for dataset expansion and semi-supervised scaling in future work.

Abstract

Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics, particularly in applications such as marine conservation, environmental monitoring, and disaster response. However, this task presents significant challenges due to environmental interference, where marine conditions degrade image quality, and the complexity of maritime scenes, which requires deeper reasoning for accurate recognition. Pure vision models alone are insufficient to address these issues. To overcome these limitations, we propose a novel multimodal Artificial Intelligence (AI) framework that integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM), to provide richer semantic understanding and improve recognition accuracy. Our framework employs an efficient multimodal fusion mechanism to further enhance model robustness and adaptability in complex maritime environments. Experimental results show that our model achieves 98 accuracy, surpassing previous SOTA models by 3.5. To optimize deployment on resource-constrained platforms, we adopt activation-aware weight quantization (AWQ) as a lightweight technique, reducing the model size to 68.75MB with only a 0.5 accuracy drop while significantly lowering computational overhead. This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.

Paper Structure

This paper contains 51 sections, 22 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Examples of marine scene categories in the dataset: marine debris, animal stranding, ship fire, ship capsize, and red tide. All images simulate low-altitude, near-water perspectives, providing realistic scenarios for training multimodal recognition systems.
  • Figure 2: Framework diagram of the multimodal marine scene recognition system.
  • Figure 3: The post-training quantization process using AWQ for efficient deployment.
  • Figure 4: Challenging maritime scene samples showing correct classifications by our model. The true labels are displayed, with our model's accurate predictions in green and the incorrect predictions from other models in red.