A Framework for Multi-View Multiple Object Tracking using Single-View Multi-Object Trackers on Fish Data
Chaim Chai Elchik, Fatemeh Karimi Nejadasl, Seyed Sahand Mohammadi Ziabari, Ali Mohammed Mansoor Alsahag
TL;DR
This work tackles MOT for small, visually similar underwater fish by adapting state-of-the-art single-view trackers (FairMOT, YOLOv8) within a stereo multi-view framework to produce 3D outputs. It builds a pipeline that trains YOLOv8, tracks with ByteTrack, applies post-track re-identification, and performs stereo matching to triangulate 3D coordinates, enabling richer behavioral analysis. Evaluation with standard MOT metrics (HOTA, DetA, AssA, MOTA, IDF1) shows strong precision but limited recall in the single-view setting, while the multi-view framework yields depth information for a subset of tracks and reduces identity fragmentation through re-ID. The results demonstrate the feasibility of leveraging single-view MOT components to create a cross-view, 3D-aware tracking framework for underwater ecological studies, with clear directions for data, hardware, and methodological improvements to enhance generalization and robustness.
Abstract
Multi-object tracking (MOT) in computer vision has made significant advancements, yet tracking small fish in underwater environments presents unique challenges due to complex 3D motions and data noise. Traditional single-view MOT models often fall short in these settings. This thesis addresses these challenges by adapting state-of-the-art single-view MOT models, FairMOT and YOLOv8, for underwater fish detecting and tracking in ecological studies. The core contribution of this research is the development of a multi-view framework that utilizes stereo video inputs to enhance tracking accuracy and fish behavior pattern recognition. By integrating and evaluating these models on underwater fish video datasets, the study aims to demonstrate significant improvements in precision and reliability compared to single-view approaches. The proposed framework detects fish entities with a relative accuracy of 47% and employs stereo-matching techniques to produce a novel 3D output, providing a more comprehensive understanding of fish movements and interactions
