Table of Contents
Fetching ...

Crayon: Customized On-Device LLM via Instant Adapter Blending and Edge-Server Hybrid Inference

Jihwan Bang, Juntae Lee, Kyuhong Shim, Seunghan Yang, Simyung Chang

TL;DR

Crayon introduces an on-device LLM customization method that instantaneously blends a pool of base LoRA adapters to tailor models for user-defined tasks, eliminating on-device training costs. It couples this with a device-server hybrid inference strategy, routing difficult or out-of-scope queries to a larger server LLM while preserving privacy by exchanging only similarity signals rather than data. The approach is evaluated on a novel on-device customization benchmark spanning multiple QA domains and MMLU subjects, showing Crayon outperforms strong baselines on-device and offering competitive gains when combined with a modest server routing regime. The work demonstrates a practical path to privacy-preserving, flexible on-device customization with scalable server-assisted augmentation, and provides a benchmark to guide future research in this area.

Abstract

The customization of large language models (LLMs) for user-specified tasks gets important. However, maintaining all the customized LLMs on cloud servers incurs substantial memory and computational overheads, and uploading user data can also lead to privacy concerns. On-device LLMs can offer a promising solution by mitigating these issues. Yet, the performance of on-device LLMs is inherently constrained by the limitations of small-scaled models. To overcome these restrictions, we first propose Crayon, a novel approach for on-device LLM customization. Crayon begins by constructing a pool of diverse base adapters, and then we instantly blend them into a customized adapter without extra training. In addition, we develop a device-server hybrid inference strategy, which deftly allocates more demanding queries or non-customized tasks to a larger, more capable LLM on a server. This ensures optimal performance without sacrificing the benefits of on-device customization. We carefully craft a novel benchmark from multiple question-answer datasets, and show the efficacy of our method in the LLM customization.

Crayon: Customized On-Device LLM via Instant Adapter Blending and Edge-Server Hybrid Inference

TL;DR

Crayon introduces an on-device LLM customization method that instantaneously blends a pool of base LoRA adapters to tailor models for user-defined tasks, eliminating on-device training costs. It couples this with a device-server hybrid inference strategy, routing difficult or out-of-scope queries to a larger server LLM while preserving privacy by exchanging only similarity signals rather than data. The approach is evaluated on a novel on-device customization benchmark spanning multiple QA domains and MMLU subjects, showing Crayon outperforms strong baselines on-device and offering competitive gains when combined with a modest server routing regime. The work demonstrates a practical path to privacy-preserving, flexible on-device customization with scalable server-assisted augmentation, and provides a benchmark to guide future research in this area.

Abstract

The customization of large language models (LLMs) for user-specified tasks gets important. However, maintaining all the customized LLMs on cloud servers incurs substantial memory and computational overheads, and uploading user data can also lead to privacy concerns. On-device LLMs can offer a promising solution by mitigating these issues. Yet, the performance of on-device LLMs is inherently constrained by the limitations of small-scaled models. To overcome these restrictions, we first propose Crayon, a novel approach for on-device LLM customization. Crayon begins by constructing a pool of diverse base adapters, and then we instantly blend them into a customized adapter without extra training. In addition, we develop a device-server hybrid inference strategy, which deftly allocates more demanding queries or non-customized tasks to a larger, more capable LLM on a server. This ensures optimal performance without sacrificing the benefits of on-device customization. We carefully craft a novel benchmark from multiple question-answer datasets, and show the efficacy of our method in the LLM customization.
Paper Structure (20 sections, 7 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 7 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overall framework of the proposed method. For on-device LLM customization without on-device training cost and privacy issue, we devise Crayon generating a suitable adapter instantly by utilizing an adapter pool including preparation of an adapter pool and deploying a customized adapter. Further, we also develop device-server hybrid inference to efficiently leverage a better generalized LLM in the server.
  • Figure 2: Example prompt input in our method. Different from few-shot prompt, this work does not utilize instruction and examples of QA.
  • Figure 3: Distribution plot of $\alpha$ for each training task on four base LoRAs. In a, c, and d, the 0th, 24th and 28th base LoRAs have different preference on the SIQA, MCQA, and OBQA tasks, respectively. In b, the 11th base LoRA is trained on all the three tasks evenly.
  • Figure 4: Device-server hybrid inference varying routing ratio. Acc (%) on (a) customized tasks and (b) mix of customized & out-of-customized tasks.
  • Figure 5: Acc (%) according to (a) the size of the customized dataset and (b) LoRA rank.
  • ...and 1 more figures