Table of Contents
Fetching ...

A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video

Cheng-Yen Yang, Hsiang-Wei Huang, Zhongyu Jiang, Hao Wang, Farron Wallace, Jenq-Neng Hwang

TL;DR

This work tackles indiscernible object counting in underwater videos by introducing the YoutubeFish-35 dataset and a novel architecture, TransVidCount, which fuses density-map supervision with a temporal attention-based transformer. The model uses a Density Map Module to generate density maps and a Temporal Density-Guided Transformer to integrate density information across frames via density-guided queries, optimizing a combined loss that includes both density and point-regression terms. On YoutubeFish-35, TransVidCount achieves state-of-the-art counting accuracy, outperforming MAN, CLTR, and IOCFormer, with frame-5 configurations yielding the best results (e.g., MAE around 13.714 and NAE around 0.394). This work provides a new benchmark for IOC in underwater scenes and demonstrates the effectiveness of temporal density cues for occluded counting, while also outlining practical considerations like latency and cropping-related ambiguities for future work.

Abstract

Dense object counting or crowd counting has come a long way thanks to the recent development in the vision community. However, indiscernible object counting, which aims to count the number of targets that are blended with respect to their surroundings, has been a challenge. Image-based object counting datasets have been the mainstream of the current publicly available datasets. Therefore, we propose a large-scale dataset called YoutubeFish-35, which contains a total of 35 sequences of high-definition videos with high frame-per-second and more than 150,000 annotated center points across a selected variety of scenes. For benchmarking purposes, we select three mainstream methods for dense object counting and carefully evaluate them on the newly collected dataset. We propose TransVidCount, a new strong baseline that combines density and regression branches along the temporal domain in a unified framework and can effectively tackle indiscernible object counting with state-of-the-art performance on YoutubeFish-35 dataset.

A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video

TL;DR

This work tackles indiscernible object counting in underwater videos by introducing the YoutubeFish-35 dataset and a novel architecture, TransVidCount, which fuses density-map supervision with a temporal attention-based transformer. The model uses a Density Map Module to generate density maps and a Temporal Density-Guided Transformer to integrate density information across frames via density-guided queries, optimizing a combined loss that includes both density and point-regression terms. On YoutubeFish-35, TransVidCount achieves state-of-the-art counting accuracy, outperforming MAN, CLTR, and IOCFormer, with frame-5 configurations yielding the best results (e.g., MAE around 13.714 and NAE around 0.394). This work provides a new benchmark for IOC in underwater scenes and demonstrates the effectiveness of temporal density cues for occluded counting, while also outlining practical considerations like latency and cropping-related ambiguities for future work.

Abstract

Dense object counting or crowd counting has come a long way thanks to the recent development in the vision community. However, indiscernible object counting, which aims to count the number of targets that are blended with respect to their surroundings, has been a challenge. Image-based object counting datasets have been the mainstream of the current publicly available datasets. Therefore, we propose a large-scale dataset called YoutubeFish-35, which contains a total of 35 sequences of high-definition videos with high frame-per-second and more than 150,000 annotated center points across a selected variety of scenes. For benchmarking purposes, we select three mainstream methods for dense object counting and carefully evaluate them on the newly collected dataset. We propose TransVidCount, a new strong baseline that combines density and regression branches along the temporal domain in a unified framework and can effectively tackle indiscernible object counting with state-of-the-art performance on YoutubeFish-35 dataset.
Paper Structure (11 sections, 6 equations, 3 figures, 3 tables)

This paper contains 11 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Visualizations of the dataset: YoutubeFish-35, the first video-based dataset for indiscernible object counting with point-level annotations.
  • Figure 2: The overview of our proposed method: TransVidCount.
  • Figure 3: Visualization of the count estimation of CLTR liang2022cltr, IOCFormer sun2023iocfish and TransVidCount. The first row contains the input images and the corresponding ground-truth counts, while the latter rows represent the predicted counts and their coordinates.