Real-time Ship Recognition and Georeferencing for the Improvement of Maritime Situational Awareness

Borja Carrillo Perez

Real-time Ship Recognition and Georeferencing for the Improvement of Maritime Situational Awareness

Borja Carrillo Perez

TL;DR

This work addresses real-time ship recognition and georeferencing to enhance maritime situational awareness. It introduces ShipSG, a real-world dataset with ship masks and geographic positions, and develops ScatYOLOv8+CBAM—an embedded-optimized, real-time segmentation architecture that fuses a 2D scattering transform with attention mechanisms. The approach achieves a high mAP around $mAP \, \approx \, 75.46$ with frame-times near $25.3$ ms on the NVIDIA Jetson AGX Xavier, and introduces a slicing strategy that improves small-ship detection by about $8$–$11\%$. A monocular georeferencing method based on image homographies yields positioning errors of approximately $18\,m$ within $400\,m$ and $44\,m$ between $400$ and $1200\,m$, enabling real-time visualization on maps and integration with other maritime data streams. Overall, the work demonstrates the viability of deep-learning-based ship recognition and georeferencing on embedded hardware, establishing ShipSG as a benchmark and offering a practical, scalable framework for maritime monitoring and decision support.

Abstract

In an era where maritime infrastructures are crucial, advanced situational awareness solutions are increasingly important. The use of optical camera systems can allow real-time usage of maritime footage. This thesis presents an investigation into leveraging deep learning and computer vision to advance real-time ship recognition and georeferencing for the improvement of maritime situational awareness. A novel dataset, ShipSG, is introduced, containing 3,505 images and 11,625 ship masks with corresponding class and geographic position. After an exploration of state-of-the-art, a custom real-time segmentation architecture, ScatYOLOv8+CBAM, is designed for the NVIDIA Jetson AGX Xavier embedded system. This architecture adds the 2D scattering transform and attention mechanisms to YOLOv8, achieving an mAP of 75.46% and an 25.3 ms per frame, outperforming state-of-the-art methods by over 5%. To improve small and distant ship recognition in high-resolution images on embedded systems, an enhanced slicing mechanism is introduced, improving mAP by 8% to 11%. Additionally, a georeferencing method is proposed, achieving positioning errors of 18 m for ships up to 400 m away and 44 m for ships between 400 m and 1200 m. The findings are also applied in real-world scenarios, such as the detection of abnormal ship behaviour, camera integrity assessment and 3D reconstruction. The approach of this thesis outperforms existing methods and provides a framework for integrating recognized and georeferenced ships into real-time systems, enhancing operational effectiveness and decision-making for maritime stakeholders. This thesis contributes to the maritime computer vision field by establishing a benchmark for ship segmentation and georeferencing research, demonstrating the viability of deep-learning-based recognition and georeferencing methods for real-time maritime monitoring.

Real-time Ship Recognition and Georeferencing for the Improvement of Maritime Situational Awareness

TL;DR

with frame-times near

ms on the NVIDIA Jetson AGX Xavier, and introduces a slicing strategy that improves small-ship detection by about

–

. A monocular georeferencing method based on image homographies yields positioning errors of approximately

within

and

between

and

, enabling real-time visualization on maps and integration with other maritime data streams. Overall, the work demonstrates the viability of deep-learning-based ship recognition and georeferencing on embedded hardware, establishing ShipSG as a benchmark and offering a practical, scalable framework for maritime monitoring and decision support.

Abstract

Paper Structure (39 sections, 15 equations, 33 figures, 8 tables)

This paper contains 39 sections, 15 equations, 33 figures, 8 tables.

Introduction
Fundamentals of Modern Object Recognition
Supervised Learning in Computer Vision
Deep-Learning-Based Object Recognition
Standard Architecture Description
Attention Mechanisms
Object Classification and Postprocessing
Training Process
Evaluation Metrics
Relevant State of the Art
Real-world Maritime Datasets
Ship Recognition Using Maritime Monitoring Footage
Georeferencing of Recognized Ships
Deployment on Embedded Systems
ShipSG: Ship Segmentation and Georeferencing Dataset
...and 24 more sections

Figures (33)

Figure 1: Conceptual representation of ship detection, segmentation and georeferencing from maritime footage. (a) Tanker being detected. Point 1 represents the bounding box center, and 2, the bottom-center point. (b) Tanker being segmented. Point 3 represents the intersection of the navigation antenna with the water. (c) Representation of the goereferenced tanker displayed on a map. The georeference from the mask, 3, provides the most accurate ship location of the three points.
Figure 2: Example of object detection and instance segmentation on an image. Object detection involves bounding box localization and classification, whereas instance segmentation goes beyond that to provide a mask outlining the exact shape of each individual object instance. Adapted from shanmugamani2018deep.
Figure 3: Standard deep learning object recognition architecture. See text for details.
Figure 4: Illustration of a standard convolution operation, taken from zhang2020lightweight. The input volume has dimensions $H \times W \times M$ (height, width, and number of channels). A filter, also named kernel, of size $K \times K \times M$ is convolved with the input, producing an output volume of dimensions $H \times W \times N$, where $N$ is the number of filters. This process involves sliding the filter over the input and computing the dot products between the filter weights and local regions of the input.
Figure 5: Illustration of max pooling and average pooling operations with a $2\times2$ pool size, taken from zhang2020lightweight. Max pooling selects the maximum value from each $2\times2$ block, while average pooling computes the average value from each $2\times2$ block, reducing the spatial dimensions of the input feature maps.
...and 28 more figures

Real-time Ship Recognition and Georeferencing for the Improvement of Maritime Situational Awareness

TL;DR

Abstract

Real-time Ship Recognition and Georeferencing for the Improvement of Maritime Situational Awareness

Authors

TL;DR

Abstract

Table of Contents

Figures (33)