Table of Contents
Fetching ...

No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son, Sung-Eui Yoon

TL;DR

This work proposes MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold and consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.

Abstract

Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.

No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

TL;DR

This work proposes MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold and consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.

Abstract

Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.
Paper Structure (42 sections, 10 equations, 13 figures, 17 tables)

This paper contains 42 sections, 10 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: Distribution of membership scores under different condition types: (a) ground-truth captions, (b) VLM-generated captions, and (c) our model-fitted embeddings. In (d), $\mathcal{L}_{\text{cond}}$ values of member samples increase under condition substitution to VLM, whereas hold-out samples remain relatively stable in (e). Dotted lines denote $\mathcal{L}_{\text{uncond}}$, and all distributions are estimated using Gaussian kernel density estimation.
  • Figure 2: Overview of our proposed method. (a) Given a query image $x_0$, we first optimize a perturbation $\delta$ to overfit to the learned representation from the model. (b) From the resulting surrogate image $x_0 + \delta^*$, we extract a model-fitted embedding $\phi^*$, which is then used as a synthetic condition to amplify the disparity between member and hold-out samples in (c).
  • Figure 3: (a) Membership score distributions when conditioned on the model-fitted embedding $\phi^*$. (b) $\mathcal{L}_{\text{cond}}$ and $\mathcal{L}_{\text{uncond}}$ for member and hold-out samples under varying conditions.
  • Figure 4: Distributions of (left) $\mathcal{L}_{\text{uncond}}$ and (right) $\mathcal{L}_{\text{cond}}$ of member and hold-out pairs from (a) Pokemon dataset and (b) model-fitted pairs of MoFit.
  • Figure 5: ASR and AUC for different initial step size and timestep $t$.
  • ...and 8 more figures