Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection

Lv Tang; Peng-Tao Jiang; Zhihao Shen; Hao Zhang; Jinwei Chen; Bo Li

Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection

Lv Tang, Peng-Tao Jiang, Zhihao Shen, Hao Zhang, Jinwei Chen, Bo Li

TL;DR

The paper addresses the challenge of camouflaged object detection (COD) in zero-shot settings, where annotated COD data and extensive retraining are unavailable. It proposes MMCPF, a promptable, two-model framework that leverages a frozen Multimodal Large Language Model (MLLM) to locate camouflaged regions and a promptable Visual Foundation Model (VFM) to generate precise masks, all without modifying model weights. A key contribution is the Chain of Visual Perception (CoVP), which enhances MLLM perception through linguistic cues (attributes, polysemy, and description diversity) and a visual completion module that refines uncertain MLLM outputs using DINOv2 semantic features to produce semantically informed prompts for the VFM. Extensive experiments on CAMO, COD10K, NC4K, MoCA-Mask, and OVCamo demonstrate that MMCPF achieves state-of-the-art zero-shot COD performance and competitive results with weakly- and fully-supervised methods, validating the potential of prompting-based perception in MLLMs for challenging vision tasks without dataset-specific training.

Abstract

In this paper, we introduce a novel multimodal camo-perceptive framework (MMCPF) aimed at handling zero-shot Camouflaged Object Detection (COD) by leveraging the powerful capabilities of Multimodal Large Language Models (MLLMs). Recognizing the inherent limitations of current COD methodologies, which predominantly rely on supervised learning models demanding extensive and accurately annotated datasets, resulting in weak generalization, our research proposes a zero-shot MMCPF that circumvents these challenges. Although MLLMs hold significant potential for broad applications, their effectiveness in COD is hindered and they would make misinterpretations of camouflaged objects. To address this challenge, we further propose a strategic enhancement called the Chain of Visual Perception (CoVP), which significantly improves the perceptual capabilities of MLLMs in camouflaged scenes by leveraging both linguistic and visual cues more effectively. We validate the effectiveness of MMCPF on five widely used COD datasets, containing CAMO, COD10K, NC4K, MoCA-Mask and OVCamo. Experiments show that MMCPF can outperform all existing state-of-the-art zero-shot COD methods, and achieve competitive performance compared to weakly-supervised and fully-supervised methods, which demonstrates the potential of MMCPF. The Github link of this paper is \url{https://github.com/luckybird1994/MMCPF}.

Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection

TL;DR

Abstract

Paper Structure (15 sections, 1 equation, 7 figures, 5 tables)

This paper contains 15 sections, 1 equation, 7 figures, 5 tables.

Introduction
Related Work
Multimodal Large Language Model
Camouflaged Object Detection
Method
Framework Overview
Chain of Visual Perception
Perception Enhanced from Linguistic Aspect
Perception Enhanced from Visual Aspect
Experiments
Datasets and Evaluation Metrics
Comparison with Different Methods
Quantitative and Qualitative Evaluation
Ablation Studies
Conclusion

Figures (7)

Figure 1: Querying results generated by GPT-4V in COD. GPT-4V would answer the question incorrectly or randomly guess some wrong answers. The red mask is generated by ground-truth and The green box is generated by GPT-4V.
Figure 2: Our multimodal camo-perceptive framework (MMCPF). MMCPF mainly contains chain of visual perception (CoVP), which is to enhance the perceptual abilities of the MLLM in camouflage scenarios from linguistic aspect and visual aspect.
Figure 3: Second column visualizes coordinates generated by MLLM, which are somewhat uncertain and cannot completely locate the camouflaged object. Third column displays coordinates generated by our visual completion mechanism. $\mathcal{P}_I$ and $\mathcal{P}_C$ are initial and completed points respectively.
Figure 4: Prompts with attribute, polysemy and diversity.
Figure 5: Performance improvement in COD10K when adding 2.physical attribute description, 3.dynamic attribute description, 4.polysemous description, 5.diverse description and 6.visual completion compared to 1.baseline.
...and 2 more figures

Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection

TL;DR

Abstract

Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)