ObCLIP: Oblivious CLoud-Device Hybrid Image Generation with Privacy Preservation
Haoqi Wu, Wei Dai, Ming Xu, Li Wang, Qiang Yan
TL;DR
ObCLIP tackles the privacy risk and high cost of cloud-based diffusion for text-to-image generation by converting a real prompt into a set of sensitive-attribute–varying candidate prompts and executing a cloud-device hybrid where the cloud handles initial denoising for all candidates and the client finishes on-device. It provides a formal privacy guarantee via λ-obliviousness and introduces practical server-side accelerations (batch and temporal redundancy, plus block skipping) to bound computational overhead. Empirical results across multiple datasets and models show that ObCLIP delivers rigorous prompt privacy with utility comparable to large cloud models, while reducing server-side cost relative to naive oblivious generation and surpassing cryptographic baselines by orders of magnitude in efficiency. The work suggests a viable path for privacy-preserving, efficient diffusion-based inference in real-world cloud services, with noted limitations and clear avenues for future improvements such as differential privacy and image-to-image extension.
Abstract
Diffusion Models have gained significant popularity due to their remarkable capabilities in image generation, albeit at the cost of intensive computation requirement. Meanwhile, despite their widespread deployment in inference services such as Midjourney, concerns about the potential leakage of sensitive information in uploaded user prompts have arisen. Existing solutions either lack rigorous privacy guarantees or fail to strike an effective balance between utility and efficiency. To bridge this gap, we propose ObCLIP, a plug-and-play safeguard that enables oblivious cloud-device hybrid generation. By oblivious, each input prompt is transformed into a set of semantically similar candidate prompts that differ only in sensitive attributes (e.g., gender, ethnicity). The cloud server processes all candidate prompts without knowing which one is the real one, thus preventing any prompt leakage. To mitigate server cost, only a small portion of denoising steps is performed upon the large cloud model. The intermediate latents are then sent back to the client, which selects the targeted latent and completes the remaining denoising using a small device model. Additionally, we analyze and incorporate several cache-based accelerations that leverage temporal and batch redundancy, effectively reducing computation cost with minimal utility degradation. Extensive experiments across multiple datasets demonstrate that ObCLIP provides rigorous privacy and comparable utility to cloud models with slightly increased server cost.
