Table of Contents
Fetching ...

Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

Seungeun Oh, Jinhyuk Kim, Jihong Park, Seung-Woo Ko, Tony Q. S. Quek, Seong-Lyun Kim

TL;DR

This work tackles the throughput bottleneck in hybrid language models that couple a mobile on-device SLM with a remote LLM by introducing Uncertainty-aware opportunistic Hybrid Language Model (U-HLM). By measuring SLM uncertainty, specifically using temperature perturbation, the device predicts LLM rejection probability and selectively skips uplink transmissions and LLM computations for tokens likely to be accepted, achieving substantial reductions in communication and computation. The approach is grounded in a linear relation between uncertainty and rejection probability, supported by theoretical bounds on rejection risk, and validated through experiments showing near-LLM accuracy with up to 2.54× higher token throughput and 45.93% fewer transmissions under challenging wireless conditions. The results demonstrate a practical path to high-throughput, on-device/offload LLM inference in resource-constrained wireless environments, with potential extensions to other token-level communication tasks.

Abstract

This paper studies a hybrid language model (HLM) architecture that integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network. The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, with rejected tokens being resampled by the LLM. While this approach ensures alignment between the vocabulary distributions of the SLM and LLM, it suffers from low token throughput due to uplink transmission and the computation costs of running both language models. To address this, we propose a novel HLM structure coined Uncertainty-aware opportunistic HLM (U-HLM), wherein the SLM locally measures its output uncertainty and skips both uplink transmissions and LLM operations for tokens that are likely to be accepted. This opportunistic skipping is enabled by our empirical finding of a linear correlation between the SLM's uncertainty and the LLM's rejection probability. We analytically derive the uncertainty threshold and evaluate its expected risk of rejection. Simulations show that U-HLM reduces uplink transmissions and LLM computations by 45.93%, while achieving up to 97.54% of the LLM's inference accuracy and 2.54$\times$ faster token throughput than HLM without skipping.

Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

TL;DR

This work tackles the throughput bottleneck in hybrid language models that couple a mobile on-device SLM with a remote LLM by introducing Uncertainty-aware opportunistic Hybrid Language Model (U-HLM). By measuring SLM uncertainty, specifically using temperature perturbation, the device predicts LLM rejection probability and selectively skips uplink transmissions and LLM computations for tokens likely to be accepted, achieving substantial reductions in communication and computation. The approach is grounded in a linear relation between uncertainty and rejection probability, supported by theoretical bounds on rejection risk, and validated through experiments showing near-LLM accuracy with up to 2.54× higher token throughput and 45.93% fewer transmissions under challenging wireless conditions. The results demonstrate a practical path to high-throughput, on-device/offload LLM inference in resource-constrained wireless environments, with potential extensions to other token-level communication tasks.

Abstract

This paper studies a hybrid language model (HLM) architecture that integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network. The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, with rejected tokens being resampled by the LLM. While this approach ensures alignment between the vocabulary distributions of the SLM and LLM, it suffers from low token throughput due to uplink transmission and the computation costs of running both language models. To address this, we propose a novel HLM structure coined Uncertainty-aware opportunistic HLM (U-HLM), wherein the SLM locally measures its output uncertainty and skips both uplink transmissions and LLM operations for tokens that are likely to be accepted. This opportunistic skipping is enabled by our empirical finding of a linear correlation between the SLM's uncertainty and the LLM's rejection probability. We analytically derive the uncertainty threshold and evaluate its expected risk of rejection. Simulations show that U-HLM reduces uplink transmissions and LLM computations by 45.93%, while achieving up to 97.54% of the LLM's inference accuracy and 2.54 faster token throughput than HLM without skipping.

Paper Structure

This paper contains 12 sections, 1 theorem, 14 equations, 6 figures, 1 table.

Key Result

Theorem 1

Under the i.i.d. assumption, with $u \coloneqq u(t)$ and $\beta \coloneqq \beta(t)$ for any $t$, the uncertainty threshold $u_{\text{th}}$ in U-HLM is given by: where $\Delta=P(y_d < x_d)$ represents the probability that a draft token $d$ is either probabilistically accepted or rejected. Defining $R$ as the expected rejection risk, where $f(u)$ denotes the probability density function (PDF) of un

Figures (6)

  • Figure 1: Schematic illustration of the proposed U-HLM over a wireless network consisting of a single device-server.
  • Figure 2: Detailed process of generating response tokens in U-HLM with $u_{\text{th}} = 0.5$: The input token sequence is processed by the SLM in two ways—one for generating the SLM's draft token and another for temperature perturbation. Tokens sampled from the temperature perturbation are compared with the draft token, returning 0 if they match and 1 otherwise, with the average used to compute the uncertainty $u(t)$. (a) When $u(t) \leq u_{\text{th}}$, uplink transmission and LLM operation are skipped. (b) When $u(t) > u_{\text{th}}$, the process continues with the LLM's verification and resampling.
  • Figure 3: Curves depicting the relationship between uncertainty and rejection probability for three uncertainty measures. Each curve includes a total of $5,134$ data points, with the line representing the mean and the shaded area indicating the 95% confidence interval.
  • Figure 4: Cosine similarity and Transmission Rate (TR) of U-HLM as a function of uncertainty threshold.
  • Figure 5: Empirical probability density of uncertainty (left) and the linear regression curve showing the relationship between uncertainty and rejection probability (right). Dashed vertical lines indicate the two uncertainty thresholds.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Remark 1
  • Theorem 1
  • proof