MIP: CLIP-based Image Reconstruction from PEFT Gradients
Peiheng Zhou, Ming Hu, Xiaofei Xie, Yihao Huang, Kangjie Chen, Mingsong Chen
TL;DR
This work analyzes privacy risks in CLIP-based federated learning that uses parameter-efficient fine-tuning. It shows that gradients from small PEFT modules (soft prompts or adapters) can reveal training images through a reconstruction attack, motivating MIP, which combines label prediction and inverse-gradient estimation to overcome gradient-vanishing in multimodal gradient flows. The method demonstrates feasible image recovery on multiple datasets, with quantitative metrics (PSNR/SSIM) indicating meaningful reconstruction, though challenges remain for large image encoders. The study highlights a significant privacy concern in multimodal PEFT FL and provides guidance on attack design and potential defensive strategies for secure aggregation and PEFT choices.
Abstract
Contrastive Language-Image Pre-training (CLIP) model, as an effective pre-trained multimodal neural network, has been widely used in distributed machine learning tasks, especially Federated Learning (FL). Typically, CLIP-based FL adopts Parameter-Efficient Fine-Tuning (PEFT) for model training, which only fine-tunes adapter parameters or soft prompts rather than the full parameters. Although PEFT is different from the traditional training mode, in this paper, we theoretically analyze that the gradients of adapters or soft prompts can still be used to perform image reconstruction attacks. Based on our theoretical analysis, we propose Multm-In-Parvo (MIP), a proprietary reconstruction attack method targeting CLIP-based distributed machine learning architecture. Specifically, MIP can reconstruct CLIP training images according to the gradients of soft prompts or an adapter. In addition, MIP includes a label prediction strategy to accelerate convergence and an inverse gradient estimation mechanism to avoid the vanishing gradient problem on the text encoder. Experimental results show that MIP can effectively reconstruct training images according to the gradients of soft prompts or adapters of CLIP models.
