Table of Contents
Fetching ...

Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos

Yuang Feng, Shuyong Gao, Fuzhen Yan, Yicheng Song, Lingyi Hong, Junjie Hu, Wenqiang Zhang

TL;DR

This work tackles Video Camouflaged Object Detection (VCOD) by introducing SRRNet, an end-to-end framework that deliberately exploits dynamic video information through memory reference frames. It combines a Reference-Guided Multilevel Asymmetric (RMA) Attention mechanism with a Dual-Purpose Decoder that jointly outputs segmentation masks and frame-wise scoring to select optimal reference frames in real time, enabling single-pass processing. The RMA Transformer backbone fuses long-term reference information with short-term motion cues across four stages, while training employs BCE for segmentation and an MSE-based score supervision, yielding robust feature extraction without external post-processing. Empirical results on MoCA-Mask and CAD show SRRNet achieving significant gains over state-of-the-art methods (approximately 10% on MoCA-Mask metrics) with 54M parameters, demonstrating strong performance and efficiency for real-world VCOD tasks.

Abstract

Video Camouflaged Object Detection (VCOD) aims to segment objects whose appearances closely resemble their surroundings, posing a challenging and emerging task. Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects and the insufficient exploitation of dynamic information in videos. To address these challenges, we propose an end-to-end VCOD framework inspired by human memory-recognition, which leverages historical video information by integrating memory reference frames for camouflaged sequence processing. Specifically, we design a dual-purpose decoder that simultaneously generates predicted masks and scores, enabling reference frame selection based on scores while introducing auxiliary supervision to enhance feature extraction.Furthermore, this study introduces a novel reference-guided multilevel asymmetric attention mechanism, effectively integrating long-term reference information with short-term motion cues for comprehensive feature extraction. By combining these modules, we develop the Scoring, Remember, and Reference (SRR) framework, which efficiently extracts information to locate targets and employs memory guidance to improve subsequent processing. With its optimized module design and effective utilization of video data, our model achieves significant performance improvements, surpassing existing approaches by 10% on benchmark datasets while requiring fewer parameters (54M) and only a single pass through the video. The code will be made publicly available.

Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos

TL;DR

This work tackles Video Camouflaged Object Detection (VCOD) by introducing SRRNet, an end-to-end framework that deliberately exploits dynamic video information through memory reference frames. It combines a Reference-Guided Multilevel Asymmetric (RMA) Attention mechanism with a Dual-Purpose Decoder that jointly outputs segmentation masks and frame-wise scoring to select optimal reference frames in real time, enabling single-pass processing. The RMA Transformer backbone fuses long-term reference information with short-term motion cues across four stages, while training employs BCE for segmentation and an MSE-based score supervision, yielding robust feature extraction without external post-processing. Empirical results on MoCA-Mask and CAD show SRRNet achieving significant gains over state-of-the-art methods (approximately 10% on MoCA-Mask metrics) with 54M parameters, demonstrating strong performance and efficiency for real-world VCOD tasks.

Abstract

Video Camouflaged Object Detection (VCOD) aims to segment objects whose appearances closely resemble their surroundings, posing a challenging and emerging task. Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects and the insufficient exploitation of dynamic information in videos. To address these challenges, we propose an end-to-end VCOD framework inspired by human memory-recognition, which leverages historical video information by integrating memory reference frames for camouflaged sequence processing. Specifically, we design a dual-purpose decoder that simultaneously generates predicted masks and scores, enabling reference frame selection based on scores while introducing auxiliary supervision to enhance feature extraction.Furthermore, this study introduces a novel reference-guided multilevel asymmetric attention mechanism, effectively integrating long-term reference information with short-term motion cues for comprehensive feature extraction. By combining these modules, we develop the Scoring, Remember, and Reference (SRR) framework, which efficiently extracts information to locate targets and employs memory guidance to improve subsequent processing. With its optimized module design and effective utilization of video data, our model achieves significant performance improvements, surpassing existing approaches by 10% on benchmark datasets while requiring fewer parameters (54M) and only a single pass through the video. The code will be made publicly available.

Paper Structure

This paper contains 16 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of methods with and without reference frame utilization. Bottom: Conventional video models that do not save reference frames. Top: Our method saves optimal frames as references to guide subsequent segmentation. The line graph illustrates the performance curves of the two methods on a sample video, with bold points highlighting frames visualized.
  • Figure 2: Overview of SRRNet. Using the proposed RMA attention mechanism, the framework constructs four stages to progressively extract multilevel features. These features are then fed into the dual-purpose decoder to generate segmentation results and predicted scores, which are subsequently used to update the reference frame.
  • Figure 3: The RMA attention mechanism consists of self-attention and cross-attention. The cross-attention operates asymmetrically across the three branches.
  • Figure 4: Dual-purpose decoder. The decoder outputs both a segmentation map and a predicted error map.
  • Figure 5: Comparison between true error and predicted error. In each pair of images, the right image represents the predicted error, while the left image shows the segmentation results, where red areas indicate false negatives, blue areas represent false positives, and white areas denote correctly predicted regions. The line graph compares the predicted MAE values with the actual MAE values, demonstrating that our prediction module effectively captures the trend of error variations between frames.
  • ...and 1 more figures