Depth-Assisted Network for Indiscernible Marine Object Counting with Adaptive Motion-Differentiated Feature Encoding
Chengzhi Ma, Kunqian Li, Shuaixin Liu, Han Mei
TL;DR
The paper tackles the problem of counting indiscernible marine objects in underwater videos by introducing VIMOC-Net, a depth-assisted network that employs a depth-enhanced encoder and adaptive motion-differentiated feature encoding. It integrates depth-aware representations, multi-scale flow estimation, and a density-conservation constraint to produce accurate density maps and counts, trained with a combination of flow, cycle-consistency, and depth supervision. A new dataset, VIMOC, with 50 videos and 40,800 annotated points, supports rigorous evaluation, where VIMOC-Net achieves state-of-the-art results (e.g., MAE $7.80$, RMSE $11.21$ on the full test set) and shows strong generalization to other video counting benchmarks. The work also demonstrates efficiency considerations and discusses future directions, including semantic cues to further enhance performance in complex underwater environments.
Abstract
Indiscernible marine object counting encounters numerous challenges, including limited visibility in underwater scenes, mutual occlusion and overlap among objects, and the dynamic similarity in appearance, color, and texture between the background and foreground. These factors significantly complicate the counting process. To address the scarcity of video-based indiscernible object counting datasets, we have developed a novel dataset comprising 50 videos, from which approximately 800 frames have been extracted and annotated with around 40,800 point-wise object labels. This dataset accurately represents real underwater environments where indiscernible marine objects are intricately integrated with their surroundings, thereby comprehensively illustrating the aforementioned challenges in object counting. To address these challenges, we propose a depth-assisted network with adaptive motion-differentiated feature encoding. The network consists of a backbone encoding module and three branches: a depth-assisting branch, a density estimation branch, and a motion weight generation branch. Depth-aware features extracted by the depth-assisting branch are enhanced via a depth-enhanced encoder to improve object representation. Meanwhile, weights from the motion weight generation branch refine multi-scale perception features in the adaptive flow estimation module. Experimental results demonstrate that our method not only achieves state-of-the-art performance on the proposed dataset but also yields competitive results on three additional video-based crowd counting datasets. The pre-trained model, code, and dataset are publicly available at https://github.com/OUCVisionGroup/VIMOC-Net.
