Table of Contents
Fetching ...

In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing

Xiao Fang, Yiming Gong, Stanislav Panev, Celso de Melo, Shuowen Hu, Shayok Chakraborty, Fernando De la Torre

Abstract

Deep neural networks (DNNs) have achieved remarkable success in computer vision but remain highly vulnerable to adversarial attacks. Among them, camouflage attacks manipulate an object's visible appearance to deceive detectors while remaining stealthy to humans. In this paper, we propose a new framework that formulates vehicle camouflage attacks as a conditional image-editing problem. Specifically, we explore both image-level and scene-level camouflage generation strategies, and fine-tune a ControlNet to synthesize camouflaged vehicles directly on real images. We design a unified objective that jointly enforces vehicle structural fidelity, style consistency, and adversarial effectiveness. Extensive experiments on the COCO and LINZ datasets show that our method achieves significantly stronger attack effectiveness, leading to more than 38% AP50 decrease, while better preserving vehicle structure and improving human-perceived stealthiness compared to existing approaches. Furthermore, our framework generalizes effectively to unseen black-box detectors and exhibits promising transferability to the physical world. Project page is available at https://humansensinglab.github.io/CtrlCamo

In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing

Abstract

Deep neural networks (DNNs) have achieved remarkable success in computer vision but remain highly vulnerable to adversarial attacks. Among them, camouflage attacks manipulate an object's visible appearance to deceive detectors while remaining stealthy to humans. In this paper, we propose a new framework that formulates vehicle camouflage attacks as a conditional image-editing problem. Specifically, we explore both image-level and scene-level camouflage generation strategies, and fine-tune a ControlNet to synthesize camouflaged vehicles directly on real images. We design a unified objective that jointly enforces vehicle structural fidelity, style consistency, and adversarial effectiveness. Extensive experiments on the COCO and LINZ datasets show that our method achieves significantly stronger attack effectiveness, leading to more than 38% AP50 decrease, while better preserving vehicle structure and improving human-perceived stealthiness compared to existing approaches. Furthermore, our framework generalizes effectively to unseen black-box detectors and exhibits promising transferability to the physical world. Project page is available at https://humansensinglab.github.io/CtrlCamo
Paper Structure (29 sections, 10 equations, 35 figures, 9 tables)

This paper contains 29 sections, 10 equations, 35 figures, 9 tables.

Figures (35)

  • Figure 1: Overview. Given a real image, our pipeline stylizes the target vehicle based on either its immediate surroundings (image-level) or a visual concept present in the overall scene (scene-level), producing stealthy adversarial examples. The numbers on the bounding boxes indicate detector confidence scores, and the absence of a box indicates that the vehicle is not detected.
  • Figure 2: Overview of our pipeline. As shown in (a) and (b), the pipeline consists of a No-Box Attack stage and a White-Box Attack stage. In (a), the ControlNet is fine-tuned to stylize vehicles using a reference region while preserving geometry and background through structure, style, and background supervisions ($L_\text{struct}$, $L_\text{s}$, $L_\text{b}$). (b) further optimizes the model against a detector $\mathcal{M}_\text{det}$ by incorporating an additional adversarial loss $L_\text{adv}$ and a color-consistency loss $L_\text{c}$. (c) summarizes the conditions provided to ControlNet under the image-level and scene-level settings, and (d) illustrates the style loss $L_\text{s}$ that aligns vehicle latent features with the reference area.
  • Figure 3: Qualitative comparison with other methods. The first two rows show results from the COCO dataset, and the last two rows are from the LINZ dataset. Within each dataset, the first row corresponds to the image-level strategy, and the second row corresponds to the scene-level strategy. Scene types are indicated on the left. In the "lake" scene, boats are stylized toward the water, while in the "parking lot" scene, cars are stylized toward trees. All camouflaged images are composed with real images background.
  • Figure 4: Projector-based physical experiment. (a) Real images in digital space. (b) Photos captured from real images. (c) Reference areas used for style guidance. (d) Camouflaged images generated from the captured photos. (e) Photos taken after projecting the camouflaged images back onto the 3D physical models.
  • Figure 5: Effectiveness of background supervision.
  • ...and 30 more figures