FMRFT: Fusion Mamba and DETR for Query Time Sequence Intersection Fish Tracking
Mingyuan Yao, Yukang Huo, Qingbin Tian, Jiayin Zhao, Xiao Liu, Ruifeng Wang, Lin Xue, Haihua Wang
TL;DR
This work targets real-time, multi-fish tracking in challenging aquaculture settings, where occlusion and high visual similarity impair performance. It presents FMRFT, a real-time end-to-end framework that fuses Fusion MIM with RT-DETR and augments it with a training-time Query Time Sequence Intersection (QTSI) and a Mamba-based Mamba Query Interaction Module (MQIM) to manage temporal memory and query interactions. The approach introduces Fusion MIM for robust multi-scale feature fusion, a QTSI mechanism to reduce redundant tracking frames, and MQIM to propagate temporal context, validated on a new sturgeon tracking dataset achieving $IDF_1=90.3\%$ and $MOTA=94.3\%$ while maintaining real-time performance. Ablation studies confirm the individual and combined benefits of Fusion MIM, QTSI, and MQIM for improved robustness to occlusion, glare, and inter-fish similarity, suggesting strong practical impact for factory aquaculture surveillance and management.
Abstract
Early detection of abnormal fish behavior caused by disease or hunger can be achieved through fish tracking using deep learning techniques, which holds significant value for industrial aquaculture. However, underwater reflections and some reasons with fish, such as the high similarity, rapid swimming caused by stimuli and mutual occlusion bring challenges to multi-target tracking of fish. To address these challenges, this paper establishes a complex multi-scenario sturgeon tracking dataset and introduces the FMRFT model, a real-time end-to-end fish tracking solution. The model incorporates the low video memory consumption Mamba In Mamba (MIM) architecture, which facilitates multi-frame temporal memory and feature extraction, thereby addressing the challenges to track multiple fish across frames. Additionally, the FMRFT model with the Query Time Sequence Intersection (QTSI) module effectively manages occluded objects and reduces redundant tracking frames using the superior feature interaction and prior frame processing capabilities of RT-DETR. This combination significantly enhances the accuracy and stability of fish tracking. Trained and tested on the dataset, the model achieves an IDF1 score of 90.3% and a MOTA accuracy of 94.3%. Experimental results show that the proposed FMRFT model effectively addresses the challenges of high similarity and mutual occlusion in fish populations, enabling accurate tracking in factory farming environments.
