Mamba-based Spatio-Frequency Motion Perception for Video Camouflaged Object Detection
Xin Li, Keren Fu, Qijun Zhao
TL;DR
Vcamba introduces a Mamba-based framework that fuses spatio-temporal and frequency-temporal motion cues to tackle video camouflaged object detection. It advances frequency learning by a frequency-domain sequential scanning strategy and leverages dual long-range motion modules (SLMP and FLMP) to model dynamics in both spatial and phase domains. A fusion module (SFMF) unifies dual-domain representations, enabling robust, efficient VCOD with state-of-the-art results on MoCA-MASK and CAD while reducing computation. The approach highlights the value of phase-based frequency motion and structured long-range modeling for camouflage-breaking in video data. Overall, Vcamba demonstrates that integrating frequency information with Vision Mamba yields accurate, efficient camouflaged object detection in challenging video scenes.
Abstract
Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearances for motion perception. However, the high foreground-background similarity in VCOD limits the discriminability of such features (e.g. color and texture). Recent studies demonstrate that frequency features can not only compensate for appearance limitations, but also perceive motion through dynamic variations in spectral energy. Meanwhile, the emerging state space model called Mamba enables efficient motion perception in frame sequences with its linear-time long-sequence modeling capability. Motivated by this, we propose Vcamba, a visual camouflage Mamba based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, by analyzing the spatial representations of frequency components, we reveal a structural evolution pattern that emerges from the ordered superposition of components. Based on this observation, we propose a unique frequency-domain sequential scanning (FSS) strategy to unfold the spectrum. Utilizing FSS, the adaptive frequency enhancement (AFE) module employs Mamba to model the causal dependencies within sequences, enabling effective frequency learning. Furthermore, we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features into unified motion representation. Experiments show that Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming its superiority. Code is available at: https://github.com/BoydeLi/Vcamba.
