Efficient Fine-Tuning with Domain Adaptation for Privacy-Preserving Vision Transformer
Teru Nagamori, Sayaka Shiota, Hitoshi Kiya
TL;DR
This work tackles privacy-preserving image classification with Vision Transformers by allowing models to train and infer on encrypted images without substantial accuracy loss. It introduces a domain-adaptation-based fine-tuning method that aligns the ViT embedding process to encrypted inputs via key-dependent adaptations, enabling effective use of block-wise encryption and pixel shuffling. The method demonstrates near-baseline accuracy on CIFAR-10, CIFAR-100, and Imagenette datasets and improves training efficiency compared with encryption-alone training, outperforming existing privacy-preserving approaches. This supports practical deployment of ViT-based systems in untrusted cloud settings where data privacy is critical.
Abstract
We propose a novel method for privacy-preserving deep neural networks (DNNs) with the Vision Transformer (ViT). The method allows us not only to train models and test with visually protected images but to also avoid the performance degradation caused from the use of encrypted images, whereas conventional methods cannot avoid the influence of image encryption. A domain adaptation method is used to efficiently fine-tune ViT with encrypted images. In experiments, the method is demonstrated to outperform conventional methods in an image classification task on the CIFAR-10 and ImageNet datasets in terms of classification accuracy.
