Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition
Xinyu Xi, Hua Yang, Shentai Zhang, Yijie Liu, Sijin Sun, Xiuju Fu
TL;DR
This work tackles maritime scene recognition under challenging environmental conditions by introducing a lightweight multimodal AI framework that fuses image features, MLLM-generated textual descriptions, and probabilistic classification vectors. The architecture uses a Swin Transformer for images, BERT for text, and an MLP for vectors, with a four-part fusion strategy that includes attention, weighted integration, alignment via mutual information and JS divergence, and dynamic modality prioritization, followed by a final classification layer. To enable real-time edge deployment on resource-constrained ASVs, the model employs Activation-aware Weight Quantization (AWQ), achieving a 68.75 MB footprint with only a 0.5% drop in accuracy (97.5% full AWQ) and substantial gains in throughput and memory efficiency. Experimental results on a curated maritime dataset show 98.0% full-precision accuracy, outperforming pure-vision and other multimodal baselines, and demonstrating robustness on challenging samples. The approach offers a practical pathway to reliable, real-time maritime monitoring and disaster response on edge devices, with potential for dataset expansion and semi-supervised scaling in future work.
Abstract
Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics, particularly in applications such as marine conservation, environmental monitoring, and disaster response. However, this task presents significant challenges due to environmental interference, where marine conditions degrade image quality, and the complexity of maritime scenes, which requires deeper reasoning for accurate recognition. Pure vision models alone are insufficient to address these issues. To overcome these limitations, we propose a novel multimodal Artificial Intelligence (AI) framework that integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM), to provide richer semantic understanding and improve recognition accuracy. Our framework employs an efficient multimodal fusion mechanism to further enhance model robustness and adaptability in complex maritime environments. Experimental results show that our model achieves 98$\%$ accuracy, surpassing previous SOTA models by 3.5$\%$. To optimize deployment on resource-constrained platforms, we adopt activation-aware weight quantization (AWQ) as a lightweight technique, reducing the model size to 68.75MB with only a 0.5$\%$ accuracy drop while significantly lowering computational overhead. This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.
