Table of Contents
Fetching ...

Depth-Assisted Network for Indiscernible Marine Object Counting with Adaptive Motion-Differentiated Feature Encoding

Chengzhi Ma, Kunqian Li, Shuaixin Liu, Han Mei

TL;DR

The paper tackles the problem of counting indiscernible marine objects in underwater videos by introducing VIMOC-Net, a depth-assisted network that employs a depth-enhanced encoder and adaptive motion-differentiated feature encoding. It integrates depth-aware representations, multi-scale flow estimation, and a density-conservation constraint to produce accurate density maps and counts, trained with a combination of flow, cycle-consistency, and depth supervision. A new dataset, VIMOC, with 50 videos and 40,800 annotated points, supports rigorous evaluation, where VIMOC-Net achieves state-of-the-art results (e.g., MAE $7.80$, RMSE $11.21$ on the full test set) and shows strong generalization to other video counting benchmarks. The work also demonstrates efficiency considerations and discusses future directions, including semantic cues to further enhance performance in complex underwater environments.

Abstract

Indiscernible marine object counting encounters numerous challenges, including limited visibility in underwater scenes, mutual occlusion and overlap among objects, and the dynamic similarity in appearance, color, and texture between the background and foreground. These factors significantly complicate the counting process. To address the scarcity of video-based indiscernible object counting datasets, we have developed a novel dataset comprising 50 videos, from which approximately 800 frames have been extracted and annotated with around 40,800 point-wise object labels. This dataset accurately represents real underwater environments where indiscernible marine objects are intricately integrated with their surroundings, thereby comprehensively illustrating the aforementioned challenges in object counting. To address these challenges, we propose a depth-assisted network with adaptive motion-differentiated feature encoding. The network consists of a backbone encoding module and three branches: a depth-assisting branch, a density estimation branch, and a motion weight generation branch. Depth-aware features extracted by the depth-assisting branch are enhanced via a depth-enhanced encoder to improve object representation. Meanwhile, weights from the motion weight generation branch refine multi-scale perception features in the adaptive flow estimation module. Experimental results demonstrate that our method not only achieves state-of-the-art performance on the proposed dataset but also yields competitive results on three additional video-based crowd counting datasets. The pre-trained model, code, and dataset are publicly available at https://github.com/OUCVisionGroup/VIMOC-Net.

Depth-Assisted Network for Indiscernible Marine Object Counting with Adaptive Motion-Differentiated Feature Encoding

TL;DR

The paper tackles the problem of counting indiscernible marine objects in underwater videos by introducing VIMOC-Net, a depth-assisted network that employs a depth-enhanced encoder and adaptive motion-differentiated feature encoding. It integrates depth-aware representations, multi-scale flow estimation, and a density-conservation constraint to produce accurate density maps and counts, trained with a combination of flow, cycle-consistency, and depth supervision. A new dataset, VIMOC, with 50 videos and 40,800 annotated points, supports rigorous evaluation, where VIMOC-Net achieves state-of-the-art results (e.g., MAE , RMSE on the full test set) and shows strong generalization to other video counting benchmarks. The work also demonstrates efficiency considerations and discusses future directions, including semantic cues to further enhance performance in complex underwater environments.

Abstract

Indiscernible marine object counting encounters numerous challenges, including limited visibility in underwater scenes, mutual occlusion and overlap among objects, and the dynamic similarity in appearance, color, and texture between the background and foreground. These factors significantly complicate the counting process. To address the scarcity of video-based indiscernible object counting datasets, we have developed a novel dataset comprising 50 videos, from which approximately 800 frames have been extracted and annotated with around 40,800 point-wise object labels. This dataset accurately represents real underwater environments where indiscernible marine objects are intricately integrated with their surroundings, thereby comprehensively illustrating the aforementioned challenges in object counting. To address these challenges, we propose a depth-assisted network with adaptive motion-differentiated feature encoding. The network consists of a backbone encoding module and three branches: a depth-assisting branch, a density estimation branch, and a motion weight generation branch. Depth-aware features extracted by the depth-assisting branch are enhanced via a depth-enhanced encoder to improve object representation. Meanwhile, weights from the motion weight generation branch refine multi-scale perception features in the adaptive flow estimation module. Experimental results demonstrate that our method not only achieves state-of-the-art performance on the proposed dataset but also yields competitive results on three additional video-based crowd counting datasets. The pre-trained model, code, and dataset are publicly available at https://github.com/OUCVisionGroup/VIMOC-Net.

Paper Structure

This paper contains 29 sections, 12 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Example of the proposed Video Indiscernible Marine Object Counting dataset (VIMOC dataset) and the prediction of our VIMOC-Net. From top to bottom: sampling frames, the corresponding pixel markers of indiscernible objects, the depth maps estimated with the Depth Anything yang2024depth, and our prediction.
  • Figure 2: The framework of the proposed VIMOC-Net, which is a depth-assisted indiscernible marine object counting network with adaptive motion-differentiated feature encoding. The network input consists of the previous frame $I_{t-1}$ and current frame $I_t$. After feature extraction by a shared encoder, the features from both frames are concatenated into a feature map $F$, which is then fed into the depth-assisting branch and density estimation branch. The depth-enhanced encoder leverages depth-aware features $F_d$ to improve indiscernible object features. The adaptive flow estimation module applies motion weights $w_i$ to estimate flow adaptively on multi-scale perception features.
  • Figure 3: Detailed structure of the Depth-Enhanced Encoder (DEE).
  • Figure 4: Detailed structures of the Adaptive Flow Estimation Module (AFEM) and the Motion Weight Generation (MWG). The AFEM is composed of the motion enhancement module and the adaptive flow feature fusion.
  • Figure 5: The proportion of indiscernible objects within specific number ranges across the entire video dataset. A representative sample from each number range is selected for display, with the corresponding count labeled at the lower left corner of each sample.
  • ...and 8 more figures