A Machine Learning Framework for Handling Unreliable Absence Label and Class Imbalance for Marine Stinger Beaching Prediction
Amuche Ibenegbu, Amandine Schaeffer, Pierre Lafaye de Micheaux, Rohitash Chandra
TL;DR
This study tackles daily bluebottle beaching prediction in Eastern Sydney under unreliable negative labels and severe class imbalance. It introduces a ML framework combining MLP, Random Forest, and XGBoost with data augmentation (SMOTE, Random Undersampling) and a Synthetic Negative Approach via CT-GAN (plus One-Class SVM) to address unreliable absences. Results show SMOTE fails to resolve class overlap and negatives, while Random Forest with CT-GAN-based synthetic negatives delivers the strongest performance, aided by visualizations (PCA) that reveal improved class separation. Wind direction and seasonal SST (notably February) emerge as key drivers, and the framework provides a practical path toward beaching risk mitigation despite data limitations such as missing population dynamics and life-cycle information.
Abstract
Bluebottles (\textit{Physalia} spp.) are marine stingers resembling jellyfish, whose presence on Australian beaches poses a significant public risk due to their venomous nature. Understanding the environmental factors driving bluebottles ashore is crucial for mitigating their impact, and machine learning tools are to date relatively unexplored. We use bluebottle marine stinger presence/absence data from beaches in Eastern Sydney, Australia, and compare machine learning models (Multilayer Perceptron, Random Forest, and XGBoost) to identify factors influencing their presence. We address challenges such as class imbalance, class overlap, and unreliable absence data by employing data augmentation techniques, including the Synthetic Minority Oversampling Technique (SMOTE), Random Undersampling, and Synthetic Negative Approach that excludes the negative class. Our results show that SMOTE failed to resolve class overlap, but the presence-focused approach effectively handled imbalance, class overlap, and ambiguous absence data. The data attributes such as the wind direction, which is a circular variable, emerged as a key factor influencing bluebottle presence, confirming previous inference studies. However, in the absence of population dynamics, biological behaviours, and life cycles, the best predictive model appears to be Random Forests combined with Synthetic Negative Approach. This research contributes to mitigating the risks posed by bluebottles to beachgoers and provides insights into handling class overlap and unreliable negative class in environmental modelling.
