Table of Contents
Fetching ...

Efficient Fine-Tuning with Domain Adaptation for Privacy-Preserving Vision Transformer

Teru Nagamori, Sayaka Shiota, Hitoshi Kiya

TL;DR

This work tackles privacy-preserving image classification with Vision Transformers by allowing models to train and infer on encrypted images without substantial accuracy loss. It introduces a domain-adaptation-based fine-tuning method that aligns the ViT embedding process to encrypted inputs via key-dependent adaptations, enabling effective use of block-wise encryption and pixel shuffling. The method demonstrates near-baseline accuracy on CIFAR-10, CIFAR-100, and Imagenette datasets and improves training efficiency compared with encryption-alone training, outperforming existing privacy-preserving approaches. This supports practical deployment of ViT-based systems in untrusted cloud settings where data privacy is critical.

Abstract

We propose a novel method for privacy-preserving deep neural networks (DNNs) with the Vision Transformer (ViT). The method allows us not only to train models and test with visually protected images but to also avoid the performance degradation caused from the use of encrypted images, whereas conventional methods cannot avoid the influence of image encryption. A domain adaptation method is used to efficiently fine-tune ViT with encrypted images. In experiments, the method is demonstrated to outperform conventional methods in an image classification task on the CIFAR-10 and ImageNet datasets in terms of classification accuracy.

Efficient Fine-Tuning with Domain Adaptation for Privacy-Preserving Vision Transformer

TL;DR

This work tackles privacy-preserving image classification with Vision Transformers by allowing models to train and infer on encrypted images without substantial accuracy loss. It introduces a domain-adaptation-based fine-tuning method that aligns the ViT embedding process to encrypted inputs via key-dependent adaptations, enabling effective use of block-wise encryption and pixel shuffling. The method demonstrates near-baseline accuracy on CIFAR-10, CIFAR-100, and Imagenette datasets and improves training efficiency compared with encryption-alone training, outperforming existing privacy-preserving approaches. This supports practical deployment of ViT-based systems in untrusted cloud settings where data privacy is critical.

Abstract

We propose a novel method for privacy-preserving deep neural networks (DNNs) with the Vision Transformer (ViT). The method allows us not only to train models and test with visually protected images but to also avoid the performance degradation caused from the use of encrypted images, whereas conventional methods cannot avoid the influence of image encryption. A domain adaptation method is used to efficiently fine-tune ViT with encrypted images. In experiments, the method is demonstrated to outperform conventional methods in an image classification task on the CIFAR-10 and ImageNet datasets in terms of classification accuracy.
Paper Structure (14 sections, 7 equations, 6 figures, 3 tables)

This paper contains 14 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Architecture of Vision Transformer ($N$=9)
  • Figure 2: Overview of proposed fine-tuning
  • Figure 3: Procedure of image encryption
  • Figure 4: Example of encrypted images
  • Figure 5: Test procedure
  • ...and 1 more figures