Synthesize Privacy-Preserving High-Resolution Images via Private Textual Intermediaries
Haoxiang Wang, Zinan Lin, Da Yu, Huishuai Zhang
TL;DR
Synthesis via Private Textual Intermediaries (SPTI) introduces an inference-only, DP-compliant pipeline that shifts privacy guarantees from high-dimensional images to the text domain by converting private images into captions, privately evolving these captions with a modified Private Evolution algorithm, and then generating high-resolution images from the evolved text using diffusion models. The key innovation is a cross-modal voting mechanism (Image Voting) that guides text evolution by evaluating the images produced from candidate texts against private data, all under $(\\epsilon,\\\delta)$-DP via Gaussian noise and adaptive composition. Empirically, SPTI achieves substantially better Fréchet Inception Distance (FID) scores than DP-finetuning and prior Private Evolution baselines on LSUN Bedroom and MM-CelebA-HQ at $\\epsilon=1.0$, while remaining compatible with proprietary API backends and avoiding model training. The framework demonstrates that text can serve as a universal, privacy-preserving interface for multimodal generation, enabling high-fidelity DP synthetic images with practical resource efficiency and broad applicability, though at the cost of higher compute overhead and potential domain-generalization limits. Overall, SPTI offers a scalable, API-friendly path to private visual data sharing and downstream analysis by privatizing the narrative (text) rather than the pixel domain.
Abstract
Generating high fidelity, differentially private (DP) synthetic images offers a promising route to share and analyze sensitive visual data without compromising individual privacy. However, existing DP image synthesis methods struggle to produce high resolution outputs that faithfully capture the structure of the original data. In this paper, we introduce a novel method, referred to as Synthesis via Private Textual Intermediaries (SPTI), that can generate high resolution DP images with easy adoption. The key idea is to shift the challenge of DP image synthesis from the image domain to the text domain by leveraging state of the art DP text generation methods. SPTI first summarizes each private image into a concise textual description using image to text models, then applies a modified Private Evolution algorithm to generate DP text, and finally reconstructs images using text to image models. Notably, SPTI requires no model training, only inference with off the shelf models. Given a private dataset, SPTI produces synthetic images of substantially higher quality than prior DP approaches. On the LSUN Bedroom dataset, SPTI attains an FID equal to 26.71 under epsilon equal to 1.0, improving over Private Evolution FID of 40.36. Similarly, on MM CelebA HQ, SPTI achieves an FID equal to 33.27 at epsilon equal to 1.0, compared to 57.01 from DP fine tuning baselines. Overall, our results demonstrate that Synthesis via Private Textual Intermediaries provides a resource efficient and proprietary model compatible framework for generating high resolution DP synthetic images, greatly expanding access to private visual datasets.
