Table of Contents
Fetching ...

FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

Yulin Chen, Zeyuan Wang, Tianyuan Yu, Yingmei Wei, Liang Bai

TL;DR

FoCLIP addresses the vulnerability of CLIP-based metrics to cross-modal misalignment by proposing a tripartite, gradient-based optimization framework that aligns image features with multiple target prompts while preserving visual fidelity. It introduces Feature Alignment Loss, Distribution Balance Loss, and Pixel-Guard Regularization Loss, and demonstrates substantial CLIPscore improvements across artistic prompts and ImageNet subsets. A grayscale sensitivity phenomenon is leveraged to develop a 91% accurate tampering detector, providing a practical defense against CLIP-based spoofing. Overall, the work highlights security risks in CLIP-based multimodal systems and offers both an effective attack framework and a complementary tampering-detection mechanism.

Abstract

The well-aligned attribute of CLIP-based models enables its effective application like CLIPscore as a widely adopted image quality assessment metric. However, such a CLIP-based metric is vulnerable for its delicate multimodal alignment. In this work, we propose \textbf{FoCLIP}, a feature-space misalignment framework for fooling CLIP-based image quality metric. Based on the stochastic gradient descent technique, FoCLIP integrates three key components to construct fooling examples: feature alignment as the core module to reduce image-text modality gaps, the score distribution balance module and pixel-guard regularization, which collectively optimize multimodal output equilibrium between CLIPscore performance and image quality. Such a design can be engineered to maximize the CLIPscore predictions across diverse input prompts, despite exhibiting either visual unrecognizability or semantic incongruence with the corresponding adversarial prompts from human perceptual perspectives. Experiments on ten artistic masterpiece prompts and ImageNet subsets demonstrate that optimized images can achieve significant improvement in CLIPscore while preserving high visual fidelity. In addition, we found that grayscale conversion induces significant feature degradation in fooling images, exhibiting noticeable CLIPscore reduction while preserving statistical consistency with original images. Inspired by this phenomenon, we propose a color channel sensitivity-driven tampering detection mechanism that achieves 91% accuracy on standard benchmarks. In conclusion, this work establishes a practical pathway for feature misalignment in CLIP-based multimodal systems and the corresponding defense method.

FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

TL;DR

FoCLIP addresses the vulnerability of CLIP-based metrics to cross-modal misalignment by proposing a tripartite, gradient-based optimization framework that aligns image features with multiple target prompts while preserving visual fidelity. It introduces Feature Alignment Loss, Distribution Balance Loss, and Pixel-Guard Regularization Loss, and demonstrates substantial CLIPscore improvements across artistic prompts and ImageNet subsets. A grayscale sensitivity phenomenon is leveraged to develop a 91% accurate tampering detector, providing a practical defense against CLIP-based spoofing. Overall, the work highlights security risks in CLIP-based multimodal systems and offers both an effective attack framework and a complementary tampering-detection mechanism.

Abstract

The well-aligned attribute of CLIP-based models enables its effective application like CLIPscore as a widely adopted image quality assessment metric. However, such a CLIP-based metric is vulnerable for its delicate multimodal alignment. In this work, we propose \textbf{FoCLIP}, a feature-space misalignment framework for fooling CLIP-based image quality metric. Based on the stochastic gradient descent technique, FoCLIP integrates three key components to construct fooling examples: feature alignment as the core module to reduce image-text modality gaps, the score distribution balance module and pixel-guard regularization, which collectively optimize multimodal output equilibrium between CLIPscore performance and image quality. Such a design can be engineered to maximize the CLIPscore predictions across diverse input prompts, despite exhibiting either visual unrecognizability or semantic incongruence with the corresponding adversarial prompts from human perceptual perspectives. Experiments on ten artistic masterpiece prompts and ImageNet subsets demonstrate that optimized images can achieve significant improvement in CLIPscore while preserving high visual fidelity. In addition, we found that grayscale conversion induces significant feature degradation in fooling images, exhibiting noticeable CLIPscore reduction while preserving statistical consistency with original images. Inspired by this phenomenon, we propose a color channel sensitivity-driven tampering detection mechanism that achieves 91% accuracy on standard benchmarks. In conclusion, this work establishes a practical pathway for feature misalignment in CLIP-based multimodal systems and the corresponding defense method.

Paper Structure

This paper contains 19 sections, 9 equations, 9 figures.

Figures (9)

  • Figure 1: Illustration of fooling CLIPscore. As shown, 0.32 is the correct score, but through our FoCLIP method, despite this being visually inconsistent, the CLIPscore is unexpectedly high.
  • Figure 2: The framework of FoCLIP, a tripartite optimization approach for adversarial CLIPscore manipulation. Built upon stochastic gradient descent (SGD) updates to the image feature vector $\mathbf{g}(x)$, this framework iteratively adjusts pixel values to bridge the modality gap between visual and textual embeddings. The architecture decomposes the adversarial process into three synergistic components: (a) Feature Alignment Loss minimizes the cosine distance between image features and target text prompts to enhance semantic alignment in CLIP's embedding space. (b) Distribution Balance Loss ensures balanced similarity scores across multiple prompts by penalizing variance, avoiding overfitting to specific concepts. (c) Pixel-Guard Regularization Loss constrains pixel values within a predefined range $[bound_{lower}, bound_{upper}]$ via ReLU limitations, preserving visual fidelity during optimization.
  • Figure 3: Heatmap of CLIPscore of famous artworks and titles, including CLIPMasterPrints for SGD, LVE and PGD approachesFreibergeCLIPMasterPrints, and comparing with our methods with 1000 and 50,000 iterations. Our fooling examples showed the best performance.
  • Figure 4: (a) CLIPscore comparison of fooling images generated by SGD, LVE, PGD and our method across 25 target classes, alongside similarity scores of corresponding ImageNet validation images. (b) Average similarity trends across 25-100 categories show our method outperforms others significantly, with minimal score degradation as category count increases (note: some variance values are imperceptible due to scale in (b)).
  • Figure 5: To illustrate the relationship between pixel-guard regularization bounds and CLIPscore, we visualize it via a 3D graph. The x- and y-axes represent $bound_{lower}\in[-1,0]$ and $bound_{upper}\in[0,1]$, while the z-axis indicates CLIPscore. Representative fooling images are displayed at key points.
  • ...and 4 more figures