Table of Contents
Fetching ...

A Distributed Framework for Privacy-Enhanced Vision Transformers on the Edge

Zihao Ding, Mufeng Zhu, Zhongze Tang, Sheng Wei, Yao Liu

TL;DR

The paper tackles privacy risks in edge-to-cloud vision by distributing Vision Transformer computations across multiple non-colluding cloud servers, while keeping the global attention and final embedding on a trusted edge. It repurposes ViT window-based attention into a partitioned offloading framework and demonstrates PED-SAM, an adaptation of SAM, achieving near-baseline segmentation with strong privacy guarantees. Implemented with PyTorch, Docker, and gRPC, the approach shows notable latency reductions for constrained edges and robust resistance to reconstruction and object-detection attacks under various partitioning schemes. Limitations include fixed partitioning and potential video-frame risks, with future work targeting adaptive partitioning and temporal privacy enhancements.

Abstract

Nowadays, visual intelligence tools have become ubiquitous, offering all kinds of convenience and possibilities. However, these tools have high computational requirements that exceed the capabilities of resource-constrained mobile and wearable devices. While offloading visual data to the cloud is a common solution, it introduces significant privacy vulnerabilities during transmission and server-side computation. To address this, we propose a novel distributed, hierarchical offloading framework for Vision Transformers (ViTs) that addresses these privacy challenges by design. Our approach uses a local trusted edge device, such as a mobile phone or an Nvidia Jetson, as the edge orchestrator. This orchestrator partitions the user's visual data into smaller portions and distributes them across multiple independent cloud servers. By design, no single external server possesses the complete image, preventing comprehensive data reconstruction. The final data merging and aggregation computation occurs exclusively on the user's trusted edge device. We apply our framework to the Segment Anything Model (SAM) as a practical case study, which demonstrates that our method substantially enhances content privacy over traditional cloud-based approaches. Evaluations show our framework maintains near-baseline segmentation performance while substantially reducing the risk of content reconstruction and user data exposure. Our framework provides a scalable, privacy-preserving solution for vision tasks in the edge-cloud continuum.

A Distributed Framework for Privacy-Enhanced Vision Transformers on the Edge

TL;DR

The paper tackles privacy risks in edge-to-cloud vision by distributing Vision Transformer computations across multiple non-colluding cloud servers, while keeping the global attention and final embedding on a trusted edge. It repurposes ViT window-based attention into a partitioned offloading framework and demonstrates PED-SAM, an adaptation of SAM, achieving near-baseline segmentation with strong privacy guarantees. Implemented with PyTorch, Docker, and gRPC, the approach shows notable latency reductions for constrained edges and robust resistance to reconstruction and object-detection attacks under various partitioning schemes. Limitations include fixed partitioning and potential video-frame risks, with future work targeting adaptive partitioning and temporal privacy enhancements.

Abstract

Nowadays, visual intelligence tools have become ubiquitous, offering all kinds of convenience and possibilities. However, these tools have high computational requirements that exceed the capabilities of resource-constrained mobile and wearable devices. While offloading visual data to the cloud is a common solution, it introduces significant privacy vulnerabilities during transmission and server-side computation. To address this, we propose a novel distributed, hierarchical offloading framework for Vision Transformers (ViTs) that addresses these privacy challenges by design. Our approach uses a local trusted edge device, such as a mobile phone or an Nvidia Jetson, as the edge orchestrator. This orchestrator partitions the user's visual data into smaller portions and distributes them across multiple independent cloud servers. By design, no single external server possesses the complete image, preventing comprehensive data reconstruction. The final data merging and aggregation computation occurs exclusively on the user's trusted edge device. We apply our framework to the Segment Anything Model (SAM) as a practical case study, which demonstrates that our method substantially enhances content privacy over traditional cloud-based approaches. Evaluations show our framework maintains near-baseline segmentation performance while substantially reducing the risk of content reconstruction and user data exposure. Our framework provides a scalable, privacy-preserving solution for vision tasks in the edge-cloud continuum.

Paper Structure

This paper contains 17 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Top: Typical cloud-based visual intelligence task workflow: offload the entire image to a cloud server for generating the image embedding. The image embedding is then used by the local device via lightweight computation for downstream vision tasks. Bottom: Our proposed privacy-enhanced solution: partition the entire image into $w\times w=w^2$ windows (e.g., 25 shown in the figure). Content in each window is processed separately by a different external privacy domain (e.g., a cloud service provider) to extract per-window embeddings. These window embeddings are then merged and further processed at a local trusted edge device to obtain the final image embedding. The extracted image embeddings are combined with a prompt encoder and a mask decoder for image segmentation tasks. The four figures on the right show example output of 2 different use cases: (i) mask generation for a query point; and (ii) automatic masks generation (a difficult task). The results show that the image embedding generated by our privacy-enhanced solution can create the same single mask generation output as the original approach; and can perform the more difficult masks generation task well with only slight performance decrease.
  • Figure 2: Existing ViT-like models use a combination of global attention and window attention blocks for extracting an image embedding for downstream vision tasks.
  • Figure 3: Our proposed hierarchical offloading framework enhances privacy in visual intelligence applications by using a local trusted edge orchestrator (e.g., an Nvidia Jetson) to partition data from a thin edge client and offload the computation across external cloud servers operated by different administrative domains. This design ensures that no single external party processes the complete data.
  • Figure 4: (a) In the original Segment Anything Model (SAM-H), 32 layers of attention are divided into 4 groups. Each group include 7 window attention layers followed by 1 global attention layer. Typically, the full image encoder (in blue-shaded box) executes at the cloud server for extracting the image embedding from an input image. (b) In our privacy-enhanced solution, an image is partitioned into 25 windows. Content of each window is processed via a third-party, external, cloud provider (thus 25 parties overall) for extracting features local to the window via 28 window attention layers, shown in blue-shaded box. Their outputs are then transmitted, received, and merged at the edge device and further processed by 4 global attention layers (shown in green-shaded box). In both figures, the multi-layer perceptron (MLP) layer operations at the end of each attention layer and 2D convolution operations at the end of the image encoder are omitted.
  • Figure 5: Visualization results of reconstructed image from ViTMAEhe2022masked and Adobe Fireflyadobefirefly using the $2\times 2$ and $4\times 3$ partition schemes. Figure \ref{['fig:visual_comp']}(a) shows the original image. In each triplet in Figure \ref{['fig:visual_comp']}(b) and Figure \ref{['fig:visual_comp']}(c), we present the masked image with highlighted window partition (left), i.e., visual data shared with the cloud server, ViTMAE reconstruction (middle), and Adobe Firefly reconstruction (right).
  • ...and 2 more figures