Table of Contents
Fetching ...

Prada: Black-Box LLM Adaptation with Private Data on Resource-Constrained Devices

Ziyao Wang, Yexiao He, Zheyu Shen, Yu Li, Guoheng Sun, Myungjin Lee, Ang Li

TL;DR

Prada addresses the challenge of adapting large language models to domain-specific private data on resource-constrained edge devices while preserving both data and model privacy. It achieves this with a two-stage approach: offline fine-tuning of a lightweight proxy LLM using LoRA on-device, and online offset-based adaptation that refines a remote black-box LLM's outputs via logits differences between the adapted and base proxy models, augmented by speculative decoding to reduce latency. The method yields strong adaptation performance comparable to centralized fine-tuning, while significantly reducing memory, communication, and latency overhead, enabling practical edge deployments. This work highlights a viable path for privacy-preserving, efficient LLM customization in privacy-sensitive, bandwidth-limited environments and outlines directions for further improvements in prompt privacy and proxy-distillation strategies.

Abstract

In recent years, Large Language Models (LLMs) have demonstrated remarkable abilities in various natural language processing tasks. However, adapting these models to specialized domains using private datasets stored on resource-constrained edge devices, such as smartphones and personal computers, remains challenging due to significant privacy concerns and limited computational resources. Existing model adaptation methods either compromise data privacy by requiring data transmission or jeopardize model privacy by exposing proprietary LLM parameters. To address these challenges, we propose Prada, a novel privacy-preserving and efficient black-box LLM adaptation system using private on-device datasets. Prada employs a lightweight proxy model fine-tuned with Low-Rank Adaptation (LoRA) locally on user devices. During inference, Prada leverages the logits offset, i.e., difference in outputs between the base and adapted proxy models, to iteratively refine outputs from a remote black-box LLM. This offset-based adaptation approach preserves both data privacy and model privacy, as there is no need to share sensitive data or proprietary model parameters. Furthermore, we incorporate speculative decoding to further speed up the inference process of Prada, making the system practically deployable on bandwidth-constrained edge devices, enabling a more practical deployment of Prada. Extensive experiments on various downstream tasks demonstrate that Prada achieves performance comparable to centralized fine-tuning methods while significantly reducing computational overhead by up to 60% and communication costs by up to 80%.

Prada: Black-Box LLM Adaptation with Private Data on Resource-Constrained Devices

TL;DR

Prada addresses the challenge of adapting large language models to domain-specific private data on resource-constrained edge devices while preserving both data and model privacy. It achieves this with a two-stage approach: offline fine-tuning of a lightweight proxy LLM using LoRA on-device, and online offset-based adaptation that refines a remote black-box LLM's outputs via logits differences between the adapted and base proxy models, augmented by speculative decoding to reduce latency. The method yields strong adaptation performance comparable to centralized fine-tuning, while significantly reducing memory, communication, and latency overhead, enabling practical edge deployments. This work highlights a viable path for privacy-preserving, efficient LLM customization in privacy-sensitive, bandwidth-limited environments and outlines directions for further improvements in prompt privacy and proxy-distillation strategies.

Abstract

In recent years, Large Language Models (LLMs) have demonstrated remarkable abilities in various natural language processing tasks. However, adapting these models to specialized domains using private datasets stored on resource-constrained edge devices, such as smartphones and personal computers, remains challenging due to significant privacy concerns and limited computational resources. Existing model adaptation methods either compromise data privacy by requiring data transmission or jeopardize model privacy by exposing proprietary LLM parameters. To address these challenges, we propose Prada, a novel privacy-preserving and efficient black-box LLM adaptation system using private on-device datasets. Prada employs a lightweight proxy model fine-tuned with Low-Rank Adaptation (LoRA) locally on user devices. During inference, Prada leverages the logits offset, i.e., difference in outputs between the base and adapted proxy models, to iteratively refine outputs from a remote black-box LLM. This offset-based adaptation approach preserves both data privacy and model privacy, as there is no need to share sensitive data or proprietary model parameters. Furthermore, we incorporate speculative decoding to further speed up the inference process of Prada, making the system practically deployable on bandwidth-constrained edge devices, enabling a more practical deployment of Prada. Extensive experiments on various downstream tasks demonstrate that Prada achieves performance comparable to centralized fine-tuning methods while significantly reducing computational overhead by up to 60% and communication costs by up to 80%.

Paper Structure

This paper contains 27 sections, 8 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The relationship between Llama models parameter size and peak GPU memory usage during LoRA fine-tuning with rank=128. The red dashed line represents the GPU memory of the Jetson AGX device used in our experiments. Models above this threshold are difficult to fine-tune directly on-device. Prada selects proxy models smaller than 7B, enabling on-device adaptation for black-box models as large as 30B.
  • Figure 2: The overview of Prada. Prada begins with the client offline fine-tuning the proxy model using LoRA. Subsequently, the client engages in online black-box offset adaptation through interactions with the server.
  • Figure 3: The offset adaptation of Prada.
  • Figure 4: Comparison of the single-GPU peak memory usage (top row) and total training time (bottom row) on MMLU training dataset of Prada and baselines during the SFT stage. The memory of Llama-30B and Qwen-34B are estimated.
  • Figure 5: The communication costs of baselines and Prada. The y-axis is the total communication cost (MB). The communication cost is calculated in three aspects: data transferring, model transferring, and inference cost. The data transferring is calculated by the Code-Instruction-120K training dataset. The inference stage cost is calculated by the communication of 1.5K queries.
  • ...and 1 more figures