Table of Contents
Fetching ...

Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning

Ruolin Shen, Xiaozhong Ji, Kai WU, Jiangning Zhang, Yijun He, HaiHua Yang, Xiaobin Hu, Xiaoyu Sun

TL;DR

This work tackles the gap between human camouflaged perception and current multimodal vision-language systems by introducing Visual Refocus Reinforcement Fine-Tuning (VRRF). VRRF trains a visual refocus policy through a GRPO-based framework augmented with exploration-aware in-context demonstrations and a curriculum of progressively harder rewards, producing emergent hierarchies of attention (focus, rethink, backtracing) to localize concealed content. The approach yields significant improvements on camouflaged object perception and detection across COD benchmarks, outperforming supervised fine-tuning and, in hard cases, surpassing human performance in user studies. The findings demonstrate a pathway to cognitively inspired multimodal systems that can align with, and in challenging scenarios exceed, human camouflage perception, with implications for both beneficial applications and ethical considerations in surveillance contexts.

Abstract

Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual analysis. To analyze this hidden human-model visual thinking discrepancy, we build a visual system that mimicks human visual camouflaged perception to progressively and iteratively `refocus' visual concealed content. The refocus is a progressive guidance mechanism enabling models to logically localize objects in visual images through stepwise reasoning. The localization process of concealed objects requires hierarchical attention shifting with dynamic adjustment and refinement of prior cognitive knowledge. In this paper, we propose a visual refocus reinforcement framework via the policy optimization algorithm to encourage multi-modal models to think and refocus more before answering, and achieve excellent reasoning abilities to align and even surpass human camouflaged perception systems. Our extensive experiments on camouflaged perception successfully demonstrate the emergence of refocus visual phenomena, characterized by multiple reasoning tokens and dynamic adjustment of the detection box. Besides, experimental results on both camouflaged object classification and detection tasks exhibit significantly superior performance compared to Supervised Fine-Tuning (SFT) baselines.

Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning

TL;DR

This work tackles the gap between human camouflaged perception and current multimodal vision-language systems by introducing Visual Refocus Reinforcement Fine-Tuning (VRRF). VRRF trains a visual refocus policy through a GRPO-based framework augmented with exploration-aware in-context demonstrations and a curriculum of progressively harder rewards, producing emergent hierarchies of attention (focus, rethink, backtracing) to localize concealed content. The approach yields significant improvements on camouflaged object perception and detection across COD benchmarks, outperforming supervised fine-tuning and, in hard cases, surpassing human performance in user studies. The findings demonstrate a pathway to cognitively inspired multimodal systems that can align with, and in challenging scenarios exceed, human camouflage perception, with implications for both beneficial applications and ethical considerations in surveillance contexts.

Abstract

Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual analysis. To analyze this hidden human-model visual thinking discrepancy, we build a visual system that mimicks human visual camouflaged perception to progressively and iteratively `refocus' visual concealed content. The refocus is a progressive guidance mechanism enabling models to logically localize objects in visual images through stepwise reasoning. The localization process of concealed objects requires hierarchical attention shifting with dynamic adjustment and refinement of prior cognitive knowledge. In this paper, we propose a visual refocus reinforcement framework via the policy optimization algorithm to encourage multi-modal models to think and refocus more before answering, and achieve excellent reasoning abilities to align and even surpass human camouflaged perception systems. Our extensive experiments on camouflaged perception successfully demonstrate the emergence of refocus visual phenomena, characterized by multiple reasoning tokens and dynamic adjustment of the detection box. Besides, experimental results on both camouflaged object classification and detection tasks exhibit significantly superior performance compared to Supervised Fine-Tuning (SFT) baselines.

Paper Structure

This paper contains 21 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Intriguing discovery of SOTA multi-modal models on limitation: these models struggle to replicate human cognitive processes in leveraging foreground-background similarity relationships for visual analysis. Mimicking human visual camouflaged reasoning perception, our Visual Refocus Reinforcement Fine-Tuning visual system progressively and logically ‘refocus’ visual concealed content.
  • Figure 2: Overview of Visual Refocus Reinforcement Fine-Tuning.
  • Figure 3: Prompt example used for in-context reinforcement learning. The <explore> block provides a multi-stage visual reasoning trajectory that mimics human perceptual shifts in attention.
  • Figure 4: Examples from our hard-concealed object set. Can you find them? Best viewed in color and zoomed-in.
  • Figure 5: Illustration of "Visual Refocus" representation pattern. The first three rows show 'focus' in the form of global to local zoom-in, 4-th row denotes 'backtracing' from local to global extension retracing after perceiving the discriminate head and wing part, and 5-th row means 'rethink' to refine and adjust the box to detect other same objects.