Table of Contents
Fetching ...

Token Level Routing Inference System for Edge Devices

Jianshu She, Wenhao Zheng, Zhengzhong Liu, Hongyi Wang, Eric Xing, Huaxiu Yao, Qirong Ho

TL;DR

The paper tackles the challenge of running high-quality language inference on edge devices by introducing token-level routing between a fast on-device small language model and a cloud-based large model. It presents two complementary frameworks, CITER and Co-LLM, to learn per-token routing or deferral policies, integrated into a full system with ONNX-based on-device inference and SGLang cloud serving. Empirical results on CommonsenseQA show that routing roughly 7% of tokens to the cloud yields over 60% relative improvement in small-model accuracy while keeping cloud traffic low, demonstrating a practical path to high-quality edge inference. The work highlights the orchestration of multi-round prefilling, stateful routing metadata, and lightweight APIs to enable real-world deployments with privacy-preserving edge execution and scalable cloud assistance.

Abstract

The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.

Token Level Routing Inference System for Edge Devices

TL;DR

The paper tackles the challenge of running high-quality language inference on edge devices by introducing token-level routing between a fast on-device small language model and a cloud-based large model. It presents two complementary frameworks, CITER and Co-LLM, to learn per-token routing or deferral policies, integrated into a full system with ONNX-based on-device inference and SGLang cloud serving. Empirical results on CommonsenseQA show that routing roughly 7% of tokens to the cloud yields over 60% relative improvement in small-model accuracy while keeping cloud traffic low, demonstrating a practical path to high-quality edge inference. The work highlights the orchestration of multi-round prefilling, stateful routing metadata, and lightweight APIs to enable real-world deployments with privacy-preserving edge execution and scalable cloud assistance.

Abstract

The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.

Paper Structure

This paper contains 12 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: System overview: First transfer Huggingface model to ONNX model, then add hidden states of last layer as a output node in ONNX computation graph, deploy ONNX model on Laptop and ONNX-mobile on Mobile phone. Then connect edge divice with router to the SG-Lang backend from server side. The router automatically route token with low confidence to server, and send response back to edge device
  • Figure 2: Computation procedure: Unlike conventional inference, the token routing system involves multiple rounds of prefill and decode within a single request, which prevents full utilization of inference acceleration engines such as SGLang and vLLM, as they only optimize kernel and KV cache on single stage prefill and decode.
  • Figure 3: User interface of the token-level routing system. Users can set prompts, thresholds, and decoding modes. Tokens from the large model are highlighted in red for interpretability.
  • Figure 4: Left: ONNX computation graph of the original Qwen-0.5B model. Right: Modified graph with last-layer hidden states exposed as an output.
  • Figure 5: An example of the custom API format used to pass internal model state and routing metadata between modules.
  • ...and 3 more figures