Table of Contents
Fetching ...

Mamba-based Spatio-Frequency Motion Perception for Video Camouflaged Object Detection

Xin Li, Keren Fu, Qijun Zhao

TL;DR

Vcamba introduces a Mamba-based framework that fuses spatio-temporal and frequency-temporal motion cues to tackle video camouflaged object detection. It advances frequency learning by a frequency-domain sequential scanning strategy and leverages dual long-range motion modules (SLMP and FLMP) to model dynamics in both spatial and phase domains. A fusion module (SFMF) unifies dual-domain representations, enabling robust, efficient VCOD with state-of-the-art results on MoCA-MASK and CAD while reducing computation. The approach highlights the value of phase-based frequency motion and structured long-range modeling for camouflage-breaking in video data. Overall, Vcamba demonstrates that integrating frequency information with Vision Mamba yields accurate, efficient camouflaged object detection in challenging video scenes.

Abstract

Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearances for motion perception. However, the high foreground-background similarity in VCOD limits the discriminability of such features (e.g. color and texture). Recent studies demonstrate that frequency features can not only compensate for appearance limitations, but also perceive motion through dynamic variations in spectral energy. Meanwhile, the emerging state space model called Mamba enables efficient motion perception in frame sequences with its linear-time long-sequence modeling capability. Motivated by this, we propose Vcamba, a visual camouflage Mamba based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, by analyzing the spatial representations of frequency components, we reveal a structural evolution pattern that emerges from the ordered superposition of components. Based on this observation, we propose a unique frequency-domain sequential scanning (FSS) strategy to unfold the spectrum. Utilizing FSS, the adaptive frequency enhancement (AFE) module employs Mamba to model the causal dependencies within sequences, enabling effective frequency learning. Furthermore, we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features into unified motion representation. Experiments show that Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming its superiority. Code is available at: https://github.com/BoydeLi/Vcamba.

Mamba-based Spatio-Frequency Motion Perception for Video Camouflaged Object Detection

TL;DR

Vcamba introduces a Mamba-based framework that fuses spatio-temporal and frequency-temporal motion cues to tackle video camouflaged object detection. It advances frequency learning by a frequency-domain sequential scanning strategy and leverages dual long-range motion modules (SLMP and FLMP) to model dynamics in both spatial and phase domains. A fusion module (SFMF) unifies dual-domain representations, enabling robust, efficient VCOD with state-of-the-art results on MoCA-MASK and CAD while reducing computation. The approach highlights the value of phase-based frequency motion and structured long-range modeling for camouflage-breaking in video data. Overall, Vcamba demonstrates that integrating frequency information with Vision Mamba yields accurate, efficient camouflaged object detection in challenging video scenes.

Abstract

Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearances for motion perception. However, the high foreground-background similarity in VCOD limits the discriminability of such features (e.g. color and texture). Recent studies demonstrate that frequency features can not only compensate for appearance limitations, but also perceive motion through dynamic variations in spectral energy. Meanwhile, the emerging state space model called Mamba enables efficient motion perception in frame sequences with its linear-time long-sequence modeling capability. Motivated by this, we propose Vcamba, a visual camouflage Mamba based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, by analyzing the spatial representations of frequency components, we reveal a structural evolution pattern that emerges from the ordered superposition of components. Based on this observation, we propose a unique frequency-domain sequential scanning (FSS) strategy to unfold the spectrum. Utilizing FSS, the adaptive frequency enhancement (AFE) module employs Mamba to model the causal dependencies within sequences, enabling effective frequency learning. Furthermore, we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features into unified motion representation. Experiments show that Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming its superiority. Code is available at: https://github.com/BoydeLi/Vcamba.

Paper Structure

This paper contains 18 sections, 12 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Visualization of frequency phase and amplitude representations in VCOD (top). The input is a camouflaged RGB image from VCOD dataset. Visual Comparison of prediction with and without frequency information (bottom). For the model without frequency, we remove the whole frequency branch and keep only the spatial modeling.
  • Figure 2: Diagram of the selective scan (SS2D) module, including the details of cross scan and S6 Block.
  • Figure 3: Overview of the proposed Vcamba. $I_t$ denotes to the input, which concatenates the windowed sequence containing N frames, $P_i$ represents the hierarchical prediction of our model, and $G_t$ denotes the ground truth corresponding to each input frame. Each VSS Decoder Layer consists of one VSSBlock.
  • Figure 4: Diagram of the adaptive frequency component enhancement (AFE) module and the frequency-domain sequential scanning (FSS) strategy.
  • Figure 5: Diagram of the structural evolution pattern emerging from the ordered superposition of frequency components. The low-to-high and high-to-low sequences are constructed by superposing components unfolded through the FSS. We provide a schematic illustration of the superposition process for each sequence, along with their corresponding spatial features.
  • ...and 7 more figures