Table of Contents
Fetching ...

Explicit Motion Handling and Interactive Prompting for Video Camouflaged Object Detection

Xin Zhang, Tao Xiao, Gepeng Ji, Xuan Wu, Keren Fu, Qijun Zhao

TL;DR

This work tackles the challenging problem of video camouflaged object detection by introducing EMIP, a two-stream framework that explicitly models motion through a frozen optical-flow backbone and inter-stream interactive prompting. By injecting segmentation-to-motion prompts via a camouflage feeder and motion-to-segmentation prompts via a motion collector, EMIP enhances both appearance-based segmentation and motion estimation, with a self-supervised flow loss guiding learning. A long-term variant, EMIP$^\dag$, incorporates historical information through a memory-augmented prompt, achieving robust temporal consistency and state-of-the-art results on MoCA-Mask and CAD, while also generalizing well to VSOD/VOS datasets. The approach demonstrates that controllable prompts across coupled vision tasks can significantly improve camouflaged-object detection in dynamic scenes, offering practical benefits for real-time video analysis and broader video segmentation tasks.

Abstract

Camouflage poses challenges in distinguishing a static target, whereas any movement of the target can break this disguise. Existing video camouflaged object detection (VCOD) approaches take noisy motion estimation as input or model motion implicitly, restricting detection performance in complex dynamic scenes. In this paper, we propose a novel Explicit Motion handling and Interactive Prompting framework for VCOD, dubbed EMIP, which handles motion cues explicitly using a frozen pre-trained optical flow fundamental model. EMIP is characterized by a two-stream architecture for simultaneously conducting camouflaged segmentation and optical flow estimation. Interactions across the dual streams are realized in an interactive prompting way that is inspired by emerging visual prompt learning. Two learnable modules, i.e., the camouflaged feeder and motion collector, are designed to incorporate segmentation-to-motion and motion-to-segmentation prompts, respectively, and enhance outputs of the both streams. The prompt fed to the motion stream is learned by supervising optical flow in a self-supervised manner. Furthermore, we show that long-term historical information can also be incorporated as a prompt into EMIP and achieve more robust results with temporal consistency. Experimental results demonstrate that our EMIP achieves new state-of-the-art records on popular VCOD benchmarks. Our code is made publicly available at https://github.com/zhangxin06/EMIP.

Explicit Motion Handling and Interactive Prompting for Video Camouflaged Object Detection

TL;DR

This work tackles the challenging problem of video camouflaged object detection by introducing EMIP, a two-stream framework that explicitly models motion through a frozen optical-flow backbone and inter-stream interactive prompting. By injecting segmentation-to-motion prompts via a camouflage feeder and motion-to-segmentation prompts via a motion collector, EMIP enhances both appearance-based segmentation and motion estimation, with a self-supervised flow loss guiding learning. A long-term variant, EMIP, incorporates historical information through a memory-augmented prompt, achieving robust temporal consistency and state-of-the-art results on MoCA-Mask and CAD, while also generalizing well to VSOD/VOS datasets. The approach demonstrates that controllable prompts across coupled vision tasks can significantly improve camouflaged-object detection in dynamic scenes, offering practical benefits for real-time video analysis and broader video segmentation tasks.

Abstract

Camouflage poses challenges in distinguishing a static target, whereas any movement of the target can break this disguise. Existing video camouflaged object detection (VCOD) approaches take noisy motion estimation as input or model motion implicitly, restricting detection performance in complex dynamic scenes. In this paper, we propose a novel Explicit Motion handling and Interactive Prompting framework for VCOD, dubbed EMIP, which handles motion cues explicitly using a frozen pre-trained optical flow fundamental model. EMIP is characterized by a two-stream architecture for simultaneously conducting camouflaged segmentation and optical flow estimation. Interactions across the dual streams are realized in an interactive prompting way that is inspired by emerging visual prompt learning. Two learnable modules, i.e., the camouflaged feeder and motion collector, are designed to incorporate segmentation-to-motion and motion-to-segmentation prompts, respectively, and enhance outputs of the both streams. The prompt fed to the motion stream is learned by supervising optical flow in a self-supervised manner. Furthermore, we show that long-term historical information can also be incorporated as a prompt into EMIP and achieve more robust results with temporal consistency. Experimental results demonstrate that our EMIP achieves new state-of-the-art records on popular VCOD benchmarks. Our code is made publicly available at https://github.com/zhangxin06/EMIP.
Paper Structure (35 sections, 6 equations, 15 figures, 9 tables)

This paper contains 35 sections, 6 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Different strategies of motion handling in VCOD: (a) Directly feed optical flow maps lamdouar2020betrayedyang2021selfsupervised; (b) Learn implicit motion cues and subsequently utilize them for mask decoding cheng2022implicit; (c) The proposed interactive prompting paradigm handles motion cues explicitly using a pre-trained optical flow model, and simultaneously conduct optical flow estimation and camouflaged object segmentation. The fire/snowflake symbols denote that most of the model parameters are learnable/frozen in the proposed scheme.
  • Figure 2: Overall architecture of the proposed EMIP, which consists of two separate streams: explicit motion modeling stream (upper) and object segmentation stream (lower). We use GMFlow xu2022gmflow as the fundamental model to handle motion cues. With the camouflage feeder and motion collector, segmentation and motion prompts are injected into each task-specific stream to compensate essential information. The fire/snowflake symbols indicate that the model parameters in this part or block are designated as learnable/frozen.
  • Figure 3: Overview of our long-term modeling scheme (EMIP$^\dag$). EMIP$^\dag$ consists of a frozen EMIP and other five learning modules (i.e., Memory Encoder, Query Encoder, STM, Motion Collector, and NCD).
  • Figure 4: Visual comparisons of our models (EMIP and EMIP$^\dag$) with eight state-of-the-art methods. We select some difficult scenarios, including dusky night, fast-moving objects, stationary objects, small objects, and noisy backgrounds.
  • Figure 5: Visual comparisons on consecutive video frames.
  • ...and 10 more figures