Table of Contents
Fetching ...

Real-time Ship Recognition and Georeferencing for the Improvement of Maritime Situational Awareness

Borja Carrillo Perez

TL;DR

This work addresses real-time ship recognition and georeferencing to enhance maritime situational awareness. It introduces ShipSG, a real-world dataset with ship masks and geographic positions, and develops ScatYOLOv8+CBAM—an embedded-optimized, real-time segmentation architecture that fuses a 2D scattering transform with attention mechanisms. The approach achieves a high mAP around $mAP \, \approx \, 75.46$ with frame-times near $25.3$ ms on the NVIDIA Jetson AGX Xavier, and introduces a slicing strategy that improves small-ship detection by about $8$–$11\%$. A monocular georeferencing method based on image homographies yields positioning errors of approximately $18\,m$ within $400\,m$ and $44\,m$ between $400$ and $1200\,m$, enabling real-time visualization on maps and integration with other maritime data streams. Overall, the work demonstrates the viability of deep-learning-based ship recognition and georeferencing on embedded hardware, establishing ShipSG as a benchmark and offering a practical, scalable framework for maritime monitoring and decision support.

Abstract

In an era where maritime infrastructures are crucial, advanced situational awareness solutions are increasingly important. The use of optical camera systems can allow real-time usage of maritime footage. This thesis presents an investigation into leveraging deep learning and computer vision to advance real-time ship recognition and georeferencing for the improvement of maritime situational awareness. A novel dataset, ShipSG, is introduced, containing 3,505 images and 11,625 ship masks with corresponding class and geographic position. After an exploration of state-of-the-art, a custom real-time segmentation architecture, ScatYOLOv8+CBAM, is designed for the NVIDIA Jetson AGX Xavier embedded system. This architecture adds the 2D scattering transform and attention mechanisms to YOLOv8, achieving an mAP of 75.46% and an 25.3 ms per frame, outperforming state-of-the-art methods by over 5%. To improve small and distant ship recognition in high-resolution images on embedded systems, an enhanced slicing mechanism is introduced, improving mAP by 8% to 11%. Additionally, a georeferencing method is proposed, achieving positioning errors of 18 m for ships up to 400 m away and 44 m for ships between 400 m and 1200 m. The findings are also applied in real-world scenarios, such as the detection of abnormal ship behaviour, camera integrity assessment and 3D reconstruction. The approach of this thesis outperforms existing methods and provides a framework for integrating recognized and georeferenced ships into real-time systems, enhancing operational effectiveness and decision-making for maritime stakeholders. This thesis contributes to the maritime computer vision field by establishing a benchmark for ship segmentation and georeferencing research, demonstrating the viability of deep-learning-based recognition and georeferencing methods for real-time maritime monitoring.

Real-time Ship Recognition and Georeferencing for the Improvement of Maritime Situational Awareness

TL;DR

This work addresses real-time ship recognition and georeferencing to enhance maritime situational awareness. It introduces ShipSG, a real-world dataset with ship masks and geographic positions, and develops ScatYOLOv8+CBAM—an embedded-optimized, real-time segmentation architecture that fuses a 2D scattering transform with attention mechanisms. The approach achieves a high mAP around with frame-times near ms on the NVIDIA Jetson AGX Xavier, and introduces a slicing strategy that improves small-ship detection by about . A monocular georeferencing method based on image homographies yields positioning errors of approximately within and between and , enabling real-time visualization on maps and integration with other maritime data streams. Overall, the work demonstrates the viability of deep-learning-based ship recognition and georeferencing on embedded hardware, establishing ShipSG as a benchmark and offering a practical, scalable framework for maritime monitoring and decision support.

Abstract

In an era where maritime infrastructures are crucial, advanced situational awareness solutions are increasingly important. The use of optical camera systems can allow real-time usage of maritime footage. This thesis presents an investigation into leveraging deep learning and computer vision to advance real-time ship recognition and georeferencing for the improvement of maritime situational awareness. A novel dataset, ShipSG, is introduced, containing 3,505 images and 11,625 ship masks with corresponding class and geographic position. After an exploration of state-of-the-art, a custom real-time segmentation architecture, ScatYOLOv8+CBAM, is designed for the NVIDIA Jetson AGX Xavier embedded system. This architecture adds the 2D scattering transform and attention mechanisms to YOLOv8, achieving an mAP of 75.46% and an 25.3 ms per frame, outperforming state-of-the-art methods by over 5%. To improve small and distant ship recognition in high-resolution images on embedded systems, an enhanced slicing mechanism is introduced, improving mAP by 8% to 11%. Additionally, a georeferencing method is proposed, achieving positioning errors of 18 m for ships up to 400 m away and 44 m for ships between 400 m and 1200 m. The findings are also applied in real-world scenarios, such as the detection of abnormal ship behaviour, camera integrity assessment and 3D reconstruction. The approach of this thesis outperforms existing methods and provides a framework for integrating recognized and georeferenced ships into real-time systems, enhancing operational effectiveness and decision-making for maritime stakeholders. This thesis contributes to the maritime computer vision field by establishing a benchmark for ship segmentation and georeferencing research, demonstrating the viability of deep-learning-based recognition and georeferencing methods for real-time maritime monitoring.
Paper Structure (39 sections, 15 equations, 33 figures, 8 tables)

This paper contains 39 sections, 15 equations, 33 figures, 8 tables.

Figures (33)

  • Figure 1: Conceptual representation of ship detection, segmentation and georeferencing from maritime footage. (a) Tanker being detected. Point 1 represents the bounding box center, and 2, the bottom-center point. (b) Tanker being segmented. Point 3 represents the intersection of the navigation antenna with the water. (c) Representation of the goereferenced tanker displayed on a map. The georeference from the mask, 3, provides the most accurate ship location of the three points.
  • Figure 2: Example of object detection and instance segmentation on an image. Object detection involves bounding box localization and classification, whereas instance segmentation goes beyond that to provide a mask outlining the exact shape of each individual object instance. Adapted from shanmugamani2018deep.
  • Figure 3: Standard deep learning object recognition architecture. See text for details.
  • Figure 4: Illustration of a standard convolution operation, taken from zhang2020lightweight. The input volume has dimensions $H \times W \times M$ (height, width, and number of channels). A filter, also named kernel, of size $K \times K \times M$ is convolved with the input, producing an output volume of dimensions $H \times W \times N$, where $N$ is the number of filters. This process involves sliding the filter over the input and computing the dot products between the filter weights and local regions of the input.
  • Figure 5: Illustration of max pooling and average pooling operations with a $2\times2$ pool size, taken from zhang2020lightweight. Max pooling selects the maximum value from each $2\times2$ block, while average pooling computes the average value from each $2\times2$ block, reducing the spatial dimensions of the input feature maps.
  • ...and 28 more figures