Table of Contents
Fetching ...

MSRNet: A Multi-Scale Recursive Network for Camouflaged Object Detection

Leena Alghamdi, Muhammad Usman, Hafeez Anwar, Abdul Bais, Saeed Anwar

TL;DR

MSRNet tackles camouflaged object detection by combining a Pyramid Vision Transformer encoder for multi-scale feature extraction with Attention-Based Scale Integration Units and a recursive-feedback decoder featuring Multi-Granularity Fusion Units. It processes an image pyramid, fuses scale-specific features, and preserves global context through recursive feedback to detect small and multiple camouflaged objects. The training objective blends BCE with an Uncertainty Awareness Loss to encourage confident predictions. On four COD datasets, MSRNet achieves state-of-the-art results on two benchmarks and ranks second on two others, albeit with higher computational cost due to multi-scale processing. Future work aims to optimize efficiency and extend COD to video domains while maintaining strong detection performance.

Abstract

Camouflaged object detection is an emerging and challenging computer vision task that requires identifying and segmenting objects that blend seamlessly into their environments due to high similarity in color, texture, and size. This task is further complicated by low-light conditions, partial occlusion, small object size, intricate background patterns, and multiple objects. While many sophisticated methods have been proposed for this task, current methods still struggle to precisely detect camouflaged objects in complex scenarios, especially with small and multiple objects, indicating room for improvement. We propose a Multi-Scale Recursive Network that extracts multi-scale features via a Pyramid Vision Transformer backbone and combines them via specialized Attention-Based Scale Integration Units, enabling selective feature merging. For more precise object detection, our decoder recursively refines features by incorporating Multi-Granularity Fusion Units. A novel recursive-feedback decoding strategy is developed to enhance global context understanding, helping the model overcome the challenges in this task. By jointly leveraging multi-scale learning and recursive feature optimization, our proposed method achieves performance gains, successfully detecting small and multiple camouflaged objects. Our model achieves state-of-the-art results on two benchmark datasets for camouflaged object detection and ranks second on the remaining two. Our codes, model weights, and results are available at \href{https://github.com/linaagh98/MSRNet}{https://github.com/linaagh98/MSRNet}.

MSRNet: A Multi-Scale Recursive Network for Camouflaged Object Detection

TL;DR

MSRNet tackles camouflaged object detection by combining a Pyramid Vision Transformer encoder for multi-scale feature extraction with Attention-Based Scale Integration Units and a recursive-feedback decoder featuring Multi-Granularity Fusion Units. It processes an image pyramid, fuses scale-specific features, and preserves global context through recursive feedback to detect small and multiple camouflaged objects. The training objective blends BCE with an Uncertainty Awareness Loss to encourage confident predictions. On four COD datasets, MSRNet achieves state-of-the-art results on two benchmarks and ranks second on two others, albeit with higher computational cost due to multi-scale processing. Future work aims to optimize efficiency and extend COD to video domains while maintaining strong detection performance.

Abstract

Camouflaged object detection is an emerging and challenging computer vision task that requires identifying and segmenting objects that blend seamlessly into their environments due to high similarity in color, texture, and size. This task is further complicated by low-light conditions, partial occlusion, small object size, intricate background patterns, and multiple objects. While many sophisticated methods have been proposed for this task, current methods still struggle to precisely detect camouflaged objects in complex scenarios, especially with small and multiple objects, indicating room for improvement. We propose a Multi-Scale Recursive Network that extracts multi-scale features via a Pyramid Vision Transformer backbone and combines them via specialized Attention-Based Scale Integration Units, enabling selective feature merging. For more precise object detection, our decoder recursively refines features by incorporating Multi-Granularity Fusion Units. A novel recursive-feedback decoding strategy is developed to enhance global context understanding, helping the model overcome the challenges in this task. By jointly leveraging multi-scale learning and recursive feature optimization, our proposed method achieves performance gains, successfully detecting small and multiple camouflaged objects. Our model achieves state-of-the-art results on two benchmark datasets for camouflaged object detection and ranks second on the remaining two. Our codes, model weights, and results are available at \href{https://github.com/linaagh98/MSRNet}{https://github.com/linaagh98/MSRNet}.

Paper Structure

This paper contains 14 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Some challenging camouflage scenarios, including: multiple objects (rows 1 and 2), small objects (row 3), and tiny objects (row 4).
  • Figure 2: The five decoding strategies in the literature: (a) The progressive decoding strategy, (b) The dense decoding strategy, (c) The feedback decoding strategy, (d) The separate decoding strategy, and (e) The pyramidal decoding strategy.
  • Figure 3: The the overall architecture of MSRNet consists of three scales of the original image, each of which is input into a PVT for feature extraction, generating four feature maps of different resolutions: $f_{1}$, $f_{2}$, $f_{3}$, and $f_{4}$. In the next stage, the feature maps of the same resolution across all scales are merged by the Attention-Based Scale Integration Unit (ABSIU). Each merged feature map is further refined inside the decoder using the Multi-Granularity Fusion Unit (MGFU). The Recursive-Feedback decoding strategy combines feedback from all lower resolutions with the current resolution being processed by the MGFU.
  • Figure 4: Feature Extraction Approach
  • Figure 5: The diagram illustrates the Attention-Based Scale Integration Unit (ABSIU) for multi-scale feature integration. Features from the three scales ($f^{1.0}$, $f^{1.5}$, $f^{2.0}$) are first aligned to a common resolution and concatenated. The attention mechanism then applies a series of convolutional layers followed by a Softmax activation layer to generate three-channel attention maps ($A^{1}_{i}$, $A^{2}_{i}$, $A^{3}_{i}$), each channel corresponds to a different scale. An element-wise multiplication $\otimes$ between the attention maps and their corresponding feature maps ($F^{1}_{i}$, $F^{2}_{i}$, $F^{3}_{i}$) is applied, resulting in three scale-grouped processed feature maps that are then summed to produce multi-scale feature maps. This process is repeated for each attention group, yielding four groups of multi-scale features. Lastly, a summation across groups merges features from all attention groups, producing the final output $F^{ABSIU}$.
  • ...and 5 more figures