Table of Contents
Fetching ...

Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models

Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, Christopher Re

TL;DR

This work investigates cost-efficient collaboration between a small on-device LM and a frontier cloud LM for long-document reasoning across finance, medicine, and science. It introduces two protocols, Minion and Minion-S, to reduce cloud inference while preserving quality, with Minion achieving substantial cloud-cost reductions and Minion-S offering near-parity with frontier models at a fraction of the cost. The authors provide a detailed analysis of design choices, including model sizes, parallelization strategies, and multi-round communication, and they demonstrate favorable cost-accuracy trade-offs on multiple benchmarks. The findings illuminate practical pathways for deploying edge-enabled reasoning systems and outline how improvements in local models will further shift workload distribution toward on-device computation, with implications for privacy and latency. Overall, the paper contributes a principled framework and empirical evidence for asymmetric edge-cloud LM collaboration that significantly lowers cloud costs while maintaining high-quality results.

Abstract

We investigate an emerging setup in which a small, on-device language model (LM) with access to local data communicates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over long documents. Can a local-remote collaboration reduce cloud inference costs while preserving quality? First, we consider a naive collaboration protocol where the local and remote models simply chat back and forth. Because only the local model reads the full context, this protocol achieves a 30.4x reduction in remote costs, but recovers only 87% of the performance of the frontier model. We identify two key limitations of this protocol: the local model struggles to (1) follow the remote model's multi-step instructions and (2) reason over long contexts. Motivated by these observations, we study an extension of this protocol, coined MinionS, in which the remote model decomposes the task into easier subtasks over shorter chunks of the document, that are executed locally in parallel. MinionS reduces costs by 5.7x on average while recovering 97.9% of the performance of the remote model alone. Our analysis reveals several key design choices that influence the trade-off between cost and performance in local-remote systems.

Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models

TL;DR

This work investigates cost-efficient collaboration between a small on-device LM and a frontier cloud LM for long-document reasoning across finance, medicine, and science. It introduces two protocols, Minion and Minion-S, to reduce cloud inference while preserving quality, with Minion achieving substantial cloud-cost reductions and Minion-S offering near-parity with frontier models at a fraction of the cost. The authors provide a detailed analysis of design choices, including model sizes, parallelization strategies, and multi-round communication, and they demonstrate favorable cost-accuracy trade-offs on multiple benchmarks. The findings illuminate practical pathways for deploying edge-enabled reasoning systems and outline how improvements in local models will further shift workload distribution toward on-device computation, with implications for privacy and latency. Overall, the paper contributes a principled framework and empirical evidence for asymmetric edge-cloud LM collaboration that significantly lowers cloud costs while maintaining high-quality results.

Abstract

We investigate an emerging setup in which a small, on-device language model (LM) with access to local data communicates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over long documents. Can a local-remote collaboration reduce cloud inference costs while preserving quality? First, we consider a naive collaboration protocol where the local and remote models simply chat back and forth. Because only the local model reads the full context, this protocol achieves a 30.4x reduction in remote costs, but recovers only 87% of the performance of the frontier model. We identify two key limitations of this protocol: the local model struggles to (1) follow the remote model's multi-step instructions and (2) reason over long contexts. Motivated by these observations, we study an extension of this protocol, coined MinionS, in which the remote model decomposes the task into easier subtasks over shorter chunks of the document, that are executed locally in parallel. MinionS reduces costs by 5.7x on average while recovering 97.9% of the performance of the remote model alone. Our analysis reveals several key design choices that influence the trade-off between cost and performance in local-remote systems.

Paper Structure

This paper contains 74 sections, 1 theorem, 19 equations, 9 figures, 7 tables.

Key Result

Proposition C.1

Assume $n_{out}^l \cdot pcks = an$, for some $a < 1$, and that $F_{r,l}$, $d_{r,l}$, and $L_{r,l}$ are all as defined in Appendix app:edgecost_models. In this case, we can show that the ratio of total latency of Minion$\mathcal{S}$ vs. the remote-only protocol is upper-bounded by the following expre

Figures (9)

  • Figure 1: Local-Remote Systems. Minion and Minion$\mathcal{S}$ protocols. (Left) Problem set-up: local and remote LM collaborate on a data-intensive reasoning task. (Center)Minion: A simple communication protocol in which the local and remote models have an "unconstrained" back and forth chat. (Right)Minion$\mathcal{S}$: an extension of Minion where the remote LM decomposes a query into many jobs that are processed in parallel by the local model. Each job is a single-step instruction over a chunk of the context.
  • Figure 2: Cost-Accuracy Tradeoff in Edge-Remote Systems. Macro-average accuracy ($y$-axis) vs. cost ($x$-axis) across FinanceBenchislam2023financebench, LongHealthadams2024longhealth, and QASPERdasigi2021dataset. Accuracy represents the fraction of correct predictions, while cost is the average USD per query based on GPT-4o rates (Jan 2025: $2.50/1M input tokens, $10.00/1M output tokens); see \ref{['sec:prelim']}. The table compares Minion (\ref{['sec:naive']}) and Minion$\mathcal{S}$ (\ref{['sec:methods']}) protocols against local-only and remote-only baselines. Points, colored by local model, use GPT-4o as the remote model. Exact metrics in Table \ref{['table:main-tradeoff']}.
  • Figure 3: Analysis of small LM limitations. Evaluation of Llama-3.2-3B on simple extraction tasks (see Section \ref{['app:naive-analysis']}). (Left) Performance drops significantly as context length increases. (Right) Increasing sub-task complexity reduces performance, with fewer sub-tasks yielding better results.
  • Figure 4: Trade-offs in edge model performance, communication efficiency, and cost of sequential communication.(Left) Accuracy vs. local model size, with the purple dashed line showing the GPT-4o model baseline. (Right) Communication efficiency of Minion$\mathcal{S}$ with different local LMs, where larger models (7–8B) are more token-efficient.
  • Figure 5: Scaling parallel jobs on-device improves quality. The x-axis represents tokens processed by the remote model, and the y-axis shows macro-average accuracy across LongHealth and QASPER. The cloud model is GPT-4o. Each plot varies a different Minion$\mathcal{S}$ hyperparameter affecting parallelism, with annotated values. (Left) Varying the number of unique instructions. (Middle) Varying the number of unique samples. (Right) Varying the chunking granularity in code $\mathbf{f}$. See Section \ref{['sec:methods']} for details.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition C.1
  • proof