Table of Contents
Fetching ...

What Lurks Within? Concept Auditing for Shared Diffusion Models at Scale

Xiaoyong Yuan, Xiaolong Ma, Linke Guo, Lan Zhang

TL;DR

This work addresses the need for practical pre-deployment auditing of fine-tuned diffusion models by introducing PAIA, a model-centric framework that avoids brittle prompt-based probes and costly image-based detectors. By analyzing late-stage denoising behavior (prompt-agnostic) and using a calibrated, image-free metric that compares fine-tuned and base models (conditional calibrated error), PAIA achieves high accuracy and significant efficiency gains on both controlled and real-world models. It is robust to adaptive attacks and generalizes to rare concepts, offering a scalable approach for safer diffusion-model sharing on public hubs. The methodology contributes to safer, more transparent deployment of generative models and lays groundwork for broader auditing and safety tooling in diffusion-based AI systems.

Abstract

Diffusion models (DMs) have revolutionized text-to-image generation, enabling the creation of highly realistic and customized images from text prompts. With the rise of parameter-efficient fine-tuning (PEFT) techniques, users can now customize powerful pre-trained models using minimal computational resources. However, the widespread sharing of fine-tuned DMs on open platforms raises growing ethical and legal concerns, as these models may inadvertently or deliberately generate sensitive or unauthorized content. Despite increasing regulatory attention on generative AI, there are currently no practical tools for systematically auditing these models before deployment. In this paper, we address the problem of concept auditing: determining whether a fine-tuned DM has learned to generate a specific target concept. Existing approaches typically rely on prompt-based input crafting and output-based image classification but they suffer from critical limitations, including prompt uncertainty, concept drift, and poor scalability. To overcome these challenges, we introduce Prompt-Agnostic Image-Free Auditing (PAIA), a novel, model-centric concept auditing framework. By treating the DM as the object of inspection, PAIA enables direct analysis of internal model behavior, bypassing the need for optimized prompts or generated images. We evaluate PAIA on 320 controlled models trained with curated concept datasets and 771 real-world community models sourced from a public DM sharing platform. Evaluation results show that PAIA achieves over 90% detection accuracy while reducing auditing time by 18 - 40X compared to existing baselines. To our knowledge, PAIA is the first scalable and practical solution for pre-deployment concept auditing of diffusion models, providing a practical foundation for safer and more transparent diffusion model sharing.

What Lurks Within? Concept Auditing for Shared Diffusion Models at Scale

TL;DR

This work addresses the need for practical pre-deployment auditing of fine-tuned diffusion models by introducing PAIA, a model-centric framework that avoids brittle prompt-based probes and costly image-based detectors. By analyzing late-stage denoising behavior (prompt-agnostic) and using a calibrated, image-free metric that compares fine-tuned and base models (conditional calibrated error), PAIA achieves high accuracy and significant efficiency gains on both controlled and real-world models. It is robust to adaptive attacks and generalizes to rare concepts, offering a scalable approach for safer diffusion-model sharing on public hubs. The methodology contributes to safer, more transparent deployment of generative models and lays groundwork for broader auditing and safety tooling in diffusion-based AI systems.

Abstract

Diffusion models (DMs) have revolutionized text-to-image generation, enabling the creation of highly realistic and customized images from text prompts. With the rise of parameter-efficient fine-tuning (PEFT) techniques, users can now customize powerful pre-trained models using minimal computational resources. However, the widespread sharing of fine-tuned DMs on open platforms raises growing ethical and legal concerns, as these models may inadvertently or deliberately generate sensitive or unauthorized content. Despite increasing regulatory attention on generative AI, there are currently no practical tools for systematically auditing these models before deployment. In this paper, we address the problem of concept auditing: determining whether a fine-tuned DM has learned to generate a specific target concept. Existing approaches typically rely on prompt-based input crafting and output-based image classification but they suffer from critical limitations, including prompt uncertainty, concept drift, and poor scalability. To overcome these challenges, we introduce Prompt-Agnostic Image-Free Auditing (PAIA), a novel, model-centric concept auditing framework. By treating the DM as the object of inspection, PAIA enables direct analysis of internal model behavior, bypassing the need for optimized prompts or generated images. We evaluate PAIA on 320 controlled models trained with curated concept datasets and 771 real-world community models sourced from a public DM sharing platform. Evaluation results show that PAIA achieves over 90% detection accuracy while reducing auditing time by 18 - 40X compared to existing baselines. To our knowledge, PAIA is the first scalable and practical solution for pre-deployment concept auditing of diffusion models, providing a practical foundation for safer and more transparent diffusion model sharing.

Paper Structure

This paper contains 47 sections, 3 theorems, 11 equations, 18 figures, 11 tables, 1 algorithm.

Key Result

Lemma 1

The gradient of the cross-attention function with respect to the text embedding $\mathbf{p}$ is given by: where $S$ denotes the cross-attention map (Eq. eq:attention_map), $S_i$ the $i$-th row of $S$, $X$ the image features calculated by the previous layers, and $Y_i$ the $i$-th output of the cross-attention layer (Eq. eq:attention_output).

Figures (18)

  • Figure 1: Overview of PAIA: A Model-Centric Concept Auditing for Fine-Tuned Diffusion Models. Community users increasingly fine-tune and share DMs on platforms such as Civitai civitai_website, HuggingFace huggingface_website, and SeaArt seaart_website, introducing risks of unauthorized or sensitive content generation. Existing approaches rely on observable behaviors—prompts and outputs—that are inherently unstable, easily manipulated, and costly to evaluate. We conduct a systematic study and propose a model-centric framework PAIA that bypasses these limitations by auditing internal model behavior directly.
  • Figure 2: Impact of prompts on cross-attention. We present the results of attention map $\|S\|$ and $\|diag(S)-SS^T\|$ with respect to the first token [BOS]. The results are averaged over all celebrity and cartoon DMs, detailed in Section \ref{['sec:exp_setting']}.
  • Figure 3: Calibration error (CE) across denoising time steps. We compare the difference in denoising error between the fine-tuned model and its base model for target vs. irrelevant concepts. Early stages (large $t$) offer stronger separation but are more prompt-sensitive. This motivates our conditional calibrated error (CCE), which preserves early-stage signals while reducing prompt influence.
  • Figure 4: Overview of PAIA Pipeline.
  • Figure 5: Auditing performance of PAIA with different prompt generation strategies.
  • ...and 13 more figures

Theorems & Definitions (5)

  • Lemma 1
  • Theorem 1
  • Remark 1
  • Remark 2
  • Lemma 2