Table of Contents
Fetching ...

A Machine Learning Framework for Handling Unreliable Absence Label and Class Imbalance for Marine Stinger Beaching Prediction

Amuche Ibenegbu, Amandine Schaeffer, Pierre Lafaye de Micheaux, Rohitash Chandra

TL;DR

This study tackles daily bluebottle beaching prediction in Eastern Sydney under unreliable negative labels and severe class imbalance. It introduces a ML framework combining MLP, Random Forest, and XGBoost with data augmentation (SMOTE, Random Undersampling) and a Synthetic Negative Approach via CT-GAN (plus One-Class SVM) to address unreliable absences. Results show SMOTE fails to resolve class overlap and negatives, while Random Forest with CT-GAN-based synthetic negatives delivers the strongest performance, aided by visualizations (PCA) that reveal improved class separation. Wind direction and seasonal SST (notably February) emerge as key drivers, and the framework provides a practical path toward beaching risk mitigation despite data limitations such as missing population dynamics and life-cycle information.

Abstract

Bluebottles (\textit{Physalia} spp.) are marine stingers resembling jellyfish, whose presence on Australian beaches poses a significant public risk due to their venomous nature. Understanding the environmental factors driving bluebottles ashore is crucial for mitigating their impact, and machine learning tools are to date relatively unexplored. We use bluebottle marine stinger presence/absence data from beaches in Eastern Sydney, Australia, and compare machine learning models (Multilayer Perceptron, Random Forest, and XGBoost) to identify factors influencing their presence. We address challenges such as class imbalance, class overlap, and unreliable absence data by employing data augmentation techniques, including the Synthetic Minority Oversampling Technique (SMOTE), Random Undersampling, and Synthetic Negative Approach that excludes the negative class. Our results show that SMOTE failed to resolve class overlap, but the presence-focused approach effectively handled imbalance, class overlap, and ambiguous absence data. The data attributes such as the wind direction, which is a circular variable, emerged as a key factor influencing bluebottle presence, confirming previous inference studies. However, in the absence of population dynamics, biological behaviours, and life cycles, the best predictive model appears to be Random Forests combined with Synthetic Negative Approach. This research contributes to mitigating the risks posed by bluebottles to beachgoers and provides insights into handling class overlap and unreliable negative class in environmental modelling.

A Machine Learning Framework for Handling Unreliable Absence Label and Class Imbalance for Marine Stinger Beaching Prediction

TL;DR

This study tackles daily bluebottle beaching prediction in Eastern Sydney under unreliable negative labels and severe class imbalance. It introduces a ML framework combining MLP, Random Forest, and XGBoost with data augmentation (SMOTE, Random Undersampling) and a Synthetic Negative Approach via CT-GAN (plus One-Class SVM) to address unreliable absences. Results show SMOTE fails to resolve class overlap and negatives, while Random Forest with CT-GAN-based synthetic negatives delivers the strongest performance, aided by visualizations (PCA) that reveal improved class separation. Wind direction and seasonal SST (notably February) emerge as key drivers, and the framework provides a practical path toward beaching risk mitigation despite data limitations such as missing population dynamics and life-cycle information.

Abstract

Bluebottles (\textit{Physalia} spp.) are marine stingers resembling jellyfish, whose presence on Australian beaches poses a significant public risk due to their venomous nature. Understanding the environmental factors driving bluebottles ashore is crucial for mitigating their impact, and machine learning tools are to date relatively unexplored. We use bluebottle marine stinger presence/absence data from beaches in Eastern Sydney, Australia, and compare machine learning models (Multilayer Perceptron, Random Forest, and XGBoost) to identify factors influencing their presence. We address challenges such as class imbalance, class overlap, and unreliable absence data by employing data augmentation techniques, including the Synthetic Minority Oversampling Technique (SMOTE), Random Undersampling, and Synthetic Negative Approach that excludes the negative class. Our results show that SMOTE failed to resolve class overlap, but the presence-focused approach effectively handled imbalance, class overlap, and ambiguous absence data. The data attributes such as the wind direction, which is a circular variable, emerged as a key factor influencing bluebottle presence, confirming previous inference studies. However, in the absence of population dynamics, biological behaviours, and life cycles, the best predictive model appears to be Random Forests combined with Synthetic Negative Approach. This research contributes to mitigating the risks posed by bluebottles to beachgoers and provides insights into handling class overlap and unreliable negative class in environmental modelling.
Paper Structure (32 sections, 2 equations, 14 figures, 9 tables)

This paper contains 32 sections, 2 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Map showing the spatial distribution of data points within the Randwick council area in the eastern Sydney, Australia. Each blue dot represents the beaches based on latitude and longitude coordinates, plotted using geospatial data.
  • Figure 2: Bluebottle marine stringer (Physalia physalis) across different scenarios and morphological variations obtained from iNaturalist website bluebottles_australia. The top-left panel shows bluebottle floating in water, illustrating its natural habitat. The top-right panel displays a beaching event, with multiple stranded bluebottles onshore. The bottom-left panel highlights a left-handed bluebottle, distinguished by the orientation of its sail. The bottom-right panel depicts a right-handed bluebottle, showcasing the opposite sail orientation.
  • Figure 3: The Neural Network architecture of the entire features vs the subgroup features, where wind and current direction share similar sub-features, while SST, wind and current speed share similar sub-features.
  • Figure 4: A framework for bluebottle modelling incorporating data preprocessing, augmentation techniques and model validation. This framework consists of six key steps including (1) data acquisition, (2) data preprocessing including exploratory analysis and feature selection, (3) baseline modelling without augmentation, (4) augmentation approach using both class imbalance approach and synthetic negative approach, (5) model training and validation, and (6) dimensionality reduction using PCA for visualisation of the different approaches implemented
  • Figure 5: Distribution of the continuous features for bluebottle presence. We notice that the SST exhibits an approximately bell-shaped curve, current speed and wind speed are right-skewed, while current direction and wind direction display bimodal distribution.
  • ...and 9 more figures