Table of Contents
Fetching ...

RGB-D Indiscernible Object Counting in Underwater Scenes

Guolei Sun, Xiaogang Cheng, Zhaochong An, Xiaokang Wang, Yun Liu, Deng-Ping Fan, Ming-Ming Cheng, Luc Van Gool

TL;DR

This work defines indiscernible object counting (IOC) and introduces IOCfish5K, a large-scale underwater IOC dataset with 659,024 center-point annotations, plus IOCfish5K-D that adds high-quality pseudo-depth maps. It proposes IOCFormer, a two-branch architecture that fuses density cues with regression via a density-enhanced transformer encoder, and extends it to IOCFormer-D to exploit RGB-D information through MDFM and CAMF. Extensive experiments show that IOCFormer sets state-of-the-art on IOCfish5K, while IOCFormer-D achieves superior performance on the RGB-D IOCfish5K-D benchmark, highlighting the value of depth in camouflaged counting. These datasets and models provide a rigorous foundation for IOC research and potential cross-domain transfer to other indiscernible counting problems in challenging scenes.

Abstract

Recently, indiscernible/camouflaged scene understanding has attracted lots of research attention in the vision community. We further advance the frontier of this field by systematically studying a new challenge named indiscernible object counting (IOC), the goal of which is to count objects that are blended with respect to their surroundings. Due to a lack of appropriate IOC datasets, we present a large-scale dataset IOCfish5K which contains a total of 5,637 high-resolution images and 659,024 annotated center points. Our dataset consists of a large number of indiscernible objects (mainly fish) in underwater scenes, making the annotation process all the more challenging. IOCfish5K is superior to existing datasets with indiscernible scenes because of its larger scale, higher image resolutions, more annotations, and denser scenes. All these aspects make it the most challenging dataset for IOC so far, supporting progress in this area. Benefiting from the recent advancements of depth estimation foundation models, we construct high-quality depth maps for IOCfish5K by generating pseudo labels using the Depth Anything V2 model. The RGB-D version of IOCfish5K is named IOCfish5K-D. For benchmarking purposes on IOCfish5K, we select 14 mainstream methods for object counting and carefully evaluate them. For multimodal IOCfish5K-D, we evaluate other 4 popular multimodal counting methods. Furthermore, we propose IOCFormer, a new strong baseline that combines density and regression branches in a unified framework and can effectively tackle object counting under concealed scenes. We also propose IOCFormer-D to enable the effective usage of depth modality in helping detect and count objects hidden in their environments. Experiments show that IOCFormer and IOCFormer-D achieve state-of-the-art scores on IOCfish5K and IOCfish5K-D, respectively.

RGB-D Indiscernible Object Counting in Underwater Scenes

TL;DR

This work defines indiscernible object counting (IOC) and introduces IOCfish5K, a large-scale underwater IOC dataset with 659,024 center-point annotations, plus IOCfish5K-D that adds high-quality pseudo-depth maps. It proposes IOCFormer, a two-branch architecture that fuses density cues with regression via a density-enhanced transformer encoder, and extends it to IOCFormer-D to exploit RGB-D information through MDFM and CAMF. Extensive experiments show that IOCFormer sets state-of-the-art on IOCfish5K, while IOCFormer-D achieves superior performance on the RGB-D IOCfish5K-D benchmark, highlighting the value of depth in camouflaged counting. These datasets and models provide a rigorous foundation for IOC research and potential cross-domain transfer to other indiscernible counting problems in challenging scenes.

Abstract

Recently, indiscernible/camouflaged scene understanding has attracted lots of research attention in the vision community. We further advance the frontier of this field by systematically studying a new challenge named indiscernible object counting (IOC), the goal of which is to count objects that are blended with respect to their surroundings. Due to a lack of appropriate IOC datasets, we present a large-scale dataset IOCfish5K which contains a total of 5,637 high-resolution images and 659,024 annotated center points. Our dataset consists of a large number of indiscernible objects (mainly fish) in underwater scenes, making the annotation process all the more challenging. IOCfish5K is superior to existing datasets with indiscernible scenes because of its larger scale, higher image resolutions, more annotations, and denser scenes. All these aspects make it the most challenging dataset for IOC so far, supporting progress in this area. Benefiting from the recent advancements of depth estimation foundation models, we construct high-quality depth maps for IOCfish5K by generating pseudo labels using the Depth Anything V2 model. The RGB-D version of IOCfish5K is named IOCfish5K-D. For benchmarking purposes on IOCfish5K, we select 14 mainstream methods for object counting and carefully evaluate them. For multimodal IOCfish5K-D, we evaluate other 4 popular multimodal counting methods. Furthermore, we propose IOCFormer, a new strong baseline that combines density and regression branches in a unified framework and can effectively tackle object counting under concealed scenes. We also propose IOCFormer-D to enable the effective usage of depth modality in helping detect and count objects hidden in their environments. Experiments show that IOCFormer and IOCFormer-D achieve state-of-the-art scores on IOCfish5K and IOCfish5K-D, respectively.
Paper Structure (25 sections, 10 equations, 9 figures, 7 tables)

This paper contains 25 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Illustration of different counting tasks.Top left: Generic Object Counting (GOC), which counts objects of various classes in natural scenes. Top right: Dense Object Counting (DOC), which counts objects of a foreground class in scenes packed with instances. Down: Indiscernible Object Counting (IOC), which counts objects of a foreground class in indiscernible scenes. Can you find all fishes in the given examples? For GOC, DOC, and IOC, the images shown are from PASCAL VOC everingham2015pascal, ShanghaiTech zhang2016single, and the new IOCfish5K dataset, respectively.
  • Figure 2: Example images from the proposed IOCfish5K. From left column to right column: typical samples, indiscernible & dense samples, indiscernible & less dense samples, less indiscernible & dense samples, less indiscernible & less dense samples.
  • Figure 3: Example images and corresponding depth maps from the proposed IOCfish5K-D. Each depth map is of high quality and contains detailed distance information about the scene. Best viewed with zooming.
  • Figure 4: Image distributions under different density (count) ranges ($<$50, 51 to 100, 101 to 200, and $>$200) in training, validation (val), and test sets of IOCfish5K.
  • Figure 5: Overview of the proposed IOCFormer. Given an input image, we extract a feature map using an RGB encoder, which is processed by a density branch and regression branch. The density-enhanced transformer encoder exploits the object density information from the density branch to generate more relevant features for the regression. Refer to §\ref{['sec:Baseline']} for more details.
  • ...and 4 more figures