Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection
Lv Tang, Peng-Tao Jiang, Zhihao Shen, Hao Zhang, Jinwei Chen, Bo Li
TL;DR
The paper addresses the challenge of camouflaged object detection (COD) in zero-shot settings, where annotated COD data and extensive retraining are unavailable. It proposes MMCPF, a promptable, two-model framework that leverages a frozen Multimodal Large Language Model (MLLM) to locate camouflaged regions and a promptable Visual Foundation Model (VFM) to generate precise masks, all without modifying model weights. A key contribution is the Chain of Visual Perception (CoVP), which enhances MLLM perception through linguistic cues (attributes, polysemy, and description diversity) and a visual completion module that refines uncertain MLLM outputs using DINOv2 semantic features to produce semantically informed prompts for the VFM. Extensive experiments on CAMO, COD10K, NC4K, MoCA-Mask, and OVCamo demonstrate that MMCPF achieves state-of-the-art zero-shot COD performance and competitive results with weakly- and fully-supervised methods, validating the potential of prompting-based perception in MLLMs for challenging vision tasks without dataset-specific training.
Abstract
In this paper, we introduce a novel multimodal camo-perceptive framework (MMCPF) aimed at handling zero-shot Camouflaged Object Detection (COD) by leveraging the powerful capabilities of Multimodal Large Language Models (MLLMs). Recognizing the inherent limitations of current COD methodologies, which predominantly rely on supervised learning models demanding extensive and accurately annotated datasets, resulting in weak generalization, our research proposes a zero-shot MMCPF that circumvents these challenges. Although MLLMs hold significant potential for broad applications, their effectiveness in COD is hindered and they would make misinterpretations of camouflaged objects. To address this challenge, we further propose a strategic enhancement called the Chain of Visual Perception (CoVP), which significantly improves the perceptual capabilities of MLLMs in camouflaged scenes by leveraging both linguistic and visual cues more effectively. We validate the effectiveness of MMCPF on five widely used COD datasets, containing CAMO, COD10K, NC4K, MoCA-Mask and OVCamo. Experiments show that MMCPF can outperform all existing state-of-the-art zero-shot COD methods, and achieve competitive performance compared to weakly-supervised and fully-supervised methods, which demonstrates the potential of MMCPF. The Github link of this paper is \url{https://github.com/luckybird1994/MMCPF}.
