Table of Contents
Fetching ...

Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images

Ali Naseh, Katherine Thai, Mohit Iyyer, Amir Houmansadr

TL;DR

This work exposes a vulnerability in AI-driven image marketplaces by showing how multimodal LLMs can reproduce AI-generated and natural images at a fraction of the cost. It introduces a three-component attack—fine-tuned CLIP for keyword extraction, a multi-label MLP for modifiers, and GPT-4V for prompt generation—augmented by an iterative prompt refinement cycle. A large Midjourney prompt-image dataset (~19M generations) supports training and evaluation, with automated metrics and human judgments indicating the attack can achieve image similarity at roughly $0.23$–$0.27$ per image. The results underscore security and economic implications for digital imagery, and the authors provide a public dataset to spur future research and defense strategies.

Abstract

With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual understanding to mimic the outputs of these platforms, introducing an original attack strategy. Our method leverages fine-tuned CLIP models, a multi-label classifier, and the descriptive capabilities of GPT-4V to create prompts that generate images similar to those available in marketplaces and from premium stock image providers, yet at a markedly lower expense. In presenting this strategy, we aim to spotlight a new class of economic and security considerations within the realm of digital imagery. Our findings, supported by both automated metrics and human assessment, reveal that comparable visual content can be produced for a fraction of the prevailing market prices ($0.23 - $0.27 per image), emphasizing the need for awareness and strategic discussions about the integrity of digital media in an increasingly AI-integrated landscape. Our work also contributes to the field by assembling a dataset consisting of approximately 19 million prompt-image pairs generated by the popular Midjourney platform, which we plan to release publicly.

Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images

TL;DR

This work exposes a vulnerability in AI-driven image marketplaces by showing how multimodal LLMs can reproduce AI-generated and natural images at a fraction of the cost. It introduces a three-component attack—fine-tuned CLIP for keyword extraction, a multi-label MLP for modifiers, and GPT-4V for prompt generation—augmented by an iterative prompt refinement cycle. A large Midjourney prompt-image dataset (~19M generations) supports training and evaluation, with automated metrics and human judgments indicating the attack can achieve image similarity at roughly per image. The results underscore security and economic implications for digital imagery, and the authors provide a public dataset to spur future research and defense strategies.

Abstract

With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual understanding to mimic the outputs of these platforms, introducing an original attack strategy. Our method leverages fine-tuned CLIP models, a multi-label classifier, and the descriptive capabilities of GPT-4V to create prompts that generate images similar to those available in marketplaces and from premium stock image providers, yet at a markedly lower expense. In presenting this strategy, we aim to spotlight a new class of economic and security considerations within the realm of digital imagery. Our findings, supported by both automated metrics and human assessment, reveal that comparable visual content can be produced for a fraction of the prevailing market prices (0.27 per image), emphasizing the need for awareness and strategic discussions about the integrity of digital media in an increasingly AI-integrated landscape. Our work also contributes to the field by assembling a dataset consisting of approximately 19 million prompt-image pairs generated by the popular Midjourney platform, which we plan to release publicly.
Paper Structure (45 sections, 8 figures, 6 tables)

This paper contains 45 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of our attack.
  • Figure 2: Examples illustrating scenarios where the baseline models, BLIP2 and CLIP Interrogator, fail to accurately extract relevant keywords and modifiers from given prompts.
  • Figure 3: Overview of the multi-label classifier's performance for different values of $k$.
  • Figure 4: Comparison of the original image with those generated by our method, BLIP2, and CLIP Interrogator for three different settings.
  • Figure 5: Comparison of natural images from two different Text-to-Image APIs with those generated by our method, BLIP2, and CLIP Interrogator.
  • ...and 3 more figures