Table of Contents
Fetching ...

When Privacy Meets Recovery: The Overlooked Half of Surrogate-Driven Privacy Preservation for MLLM Editing

Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui Li, Shiqi Wang, Sam Kwong

TL;DR

This work tackles the overlooked problem of recovering surrogate-driven edits in privacy-preserving MLLM workflows. It introduces SPPE, a comprehensive dataset for evaluating edit fidelity under surrogate-based privacy, and SOER, a DiT-based multimodal framework that reconstructs MLLM-edited outputs on original content while preserving privacy. The approach integrates semantic, visual, and spatial guidance with region-weighted losses, achieving superior edit fidelity and privacy preservation on SPPE and InstructPix2Pix. The results demonstrate robust generalization across diverse content and editing tasks, offering a practical path for privacy-aware MLLM applications.

Abstract

Privacy leakage in Multimodal Large Language Models (MLLMs) has long been an intractable problem. Existing studies, though effectively obscure private information in MLLMs, often overlook the evaluation of the authenticity and recovery quality of user privacy. To this end, this work uniquely focuses on the critical challenge of how to restore surrogate-driven protected data in diverse MLLM scenarios. We first bridge this research gap by contributing the SPPE (Surrogate Privacy Protected Editable) dataset, which includes a wide range of privacy categories and user instructions to simulate real MLLM applications. This dataset offers protected surrogates alongside their various MLLM-edited versions, thus enabling the direct assessment of privacy recovery quality. By formulating privacy recovery as a guided generation task conditioned on complementary multimodal signals, we further introduce a unified approach that reliably reconstructs private content while preserving the fidelity of MLLM-generated edits. The experiments on both SPPE and InstructPix2Pix further show that our approach generalizes well across diverse visual content and editing tasks, achieving a strong balance between privacy protection and MLLM usability.

When Privacy Meets Recovery: The Overlooked Half of Surrogate-Driven Privacy Preservation for MLLM Editing

TL;DR

This work tackles the overlooked problem of recovering surrogate-driven edits in privacy-preserving MLLM workflows. It introduces SPPE, a comprehensive dataset for evaluating edit fidelity under surrogate-based privacy, and SOER, a DiT-based multimodal framework that reconstructs MLLM-edited outputs on original content while preserving privacy. The approach integrates semantic, visual, and spatial guidance with region-weighted losses, achieving superior edit fidelity and privacy preservation on SPPE and InstructPix2Pix. The results demonstrate robust generalization across diverse content and editing tasks, offering a practical path for privacy-aware MLLM applications.

Abstract

Privacy leakage in Multimodal Large Language Models (MLLMs) has long been an intractable problem. Existing studies, though effectively obscure private information in MLLMs, often overlook the evaluation of the authenticity and recovery quality of user privacy. To this end, this work uniquely focuses on the critical challenge of how to restore surrogate-driven protected data in diverse MLLM scenarios. We first bridge this research gap by contributing the SPPE (Surrogate Privacy Protected Editable) dataset, which includes a wide range of privacy categories and user instructions to simulate real MLLM applications. This dataset offers protected surrogates alongside their various MLLM-edited versions, thus enabling the direct assessment of privacy recovery quality. By formulating privacy recovery as a guided generation task conditioned on complementary multimodal signals, we further introduce a unified approach that reliably reconstructs private content while preserving the fidelity of MLLM-generated edits. The experiments on both SPPE and InstructPix2Pix further show that our approach generalizes well across diverse visual content and editing tasks, achieving a strong balance between privacy protection and MLLM usability.

Paper Structure

This paper contains 26 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Demonstration of our Edit-Compatible Surrogate-Driven Privacy Protection paradigm. Sensitive regions in the original image are locally replaced with synthetic content to create a surrogate, which is sent to the cloud for editing by MLLMs. The surrogate's edits are locally combined with the original image to produce a privacy-preserving output that faithfully reflects MLLM-intended modifications.
  • Figure 2: The sensitive category (C) is “license plate,” and the edit prompt is “Turn to penciled style.” The original image $I$ contains the private content “SUBARU,” which is replaced by a synthetic one, “908ABD,” in the surrogate image $S$. However, the MLLM-edited surrogate output $S'$ retains the synthetic plate “908ABD” rather than reflecting the original content, necessitating recovery of the surrogate output to better approximate the edited original image $I'$.
  • Figure 3: Performance of surrogate generation. This example shows an image containing a sensitive student ID card region (leftmost panel). On the right, the top row compares our surrogate method with traditional privacy protection techniques, demonstrating superior concealment of private content while maintaining semantic coherence. The bottom row presents surrogates generated with varying protection strengths, where a higher strength indicates less influence from the original image. We select a middle strength that achieves a flexible trade-off between privacy protection and semantic fidelity.
  • Figure 4: Distribution of privacy-related categories across three modality types (Textual, Visual, and Multimodal). The outer ring represents fine-grained categories, while the inner ring shows their grouping into higher-level modality classes.
  • Figure 5: Overview of the proposed Surrogate-to-Original Editable Recovery (SOER). SOER processes semantic, visual, restoration, and edit cues through dedicated encoders to extract rich spatial and directional information. The resulting embeddings are combined and fed into a DiT-based transformer for multimodal interaction, enabling the generation of an edited image that faithfully reflects MLLM-driven edits while remaining consistent with the original content.
  • ...and 2 more figures