Table of Contents
Fetching ...

Compact LLM Deployment and World Model Assisted Offloading in Mobile Edge Computing

Ruichen Zhang, Xiaofeng Luo, Jiayi He, Dusit Niyato, Jiawen Kang, Zehui Xiong, Yonghui Li

TL;DR

A world model-proximal policy optimization (PPO) algorithm is developed, which augments an on-policy PPO algorithm with a learned recurrent world model that provides improved value targets and short imagination rollouts and approaches the generation quality of always-offloading with much of the efficiency of local execution.

Abstract

This paper investigates compact large language model (LLM) deployment and world-model-assisted inference offloading in mobile edge computing (MEC) networks. We first propose an edge compact LLM deployment (ECLD) framework that jointly applies structured pruning, low-bit quantization, and knowledge distillation to construct edge-deployable LLM variants, and we evaluate these models using four complementary metrics: accessibility, energy consumption, hallucination rate, and generalization accuracy. Building on the resulting compact models, we formulate an MEC offloading optimization problem that minimizes the long-term average inference latency subject to per-device energy budgets and LLM-specific quality-of-service constraints on effective accuracy and hallucination. To solve this problem under unknown and time-varying network dynamics, we develop a world model-proximal policy optimization (PPO) algorithm, which augments an on-policy PPO algorithm with a learned recurrent world model that provides improved value targets and short imagination rollouts. Extensive experiments on Llama-3.1-8B, Qwen3-8B, and Mistral-12B show that ECLD compresses base models by about 70-80% in storage (i.e., from 15.3 GB to 3.3 GB for Llama-3.1-8B) and reduces per-query energy consumption by up to 50%, while largely preserving accuracy and often lowering hallucination compared with quantization-only or pruning-only baselines. Moreover, they also show that world model-PPO speeds up convergence by about 50%, improves the final reward by 15.8% over vanilla PPO, and reduces average inference latency by 12-30% across different user populations, while satisfying the accuracy and hallucination constraints and approaching the generation quality of always-offloading with much of the efficiency of local execution.

Compact LLM Deployment and World Model Assisted Offloading in Mobile Edge Computing

TL;DR

A world model-proximal policy optimization (PPO) algorithm is developed, which augments an on-policy PPO algorithm with a learned recurrent world model that provides improved value targets and short imagination rollouts and approaches the generation quality of always-offloading with much of the efficiency of local execution.

Abstract

This paper investigates compact large language model (LLM) deployment and world-model-assisted inference offloading in mobile edge computing (MEC) networks. We first propose an edge compact LLM deployment (ECLD) framework that jointly applies structured pruning, low-bit quantization, and knowledge distillation to construct edge-deployable LLM variants, and we evaluate these models using four complementary metrics: accessibility, energy consumption, hallucination rate, and generalization accuracy. Building on the resulting compact models, we formulate an MEC offloading optimization problem that minimizes the long-term average inference latency subject to per-device energy budgets and LLM-specific quality-of-service constraints on effective accuracy and hallucination. To solve this problem under unknown and time-varying network dynamics, we develop a world model-proximal policy optimization (PPO) algorithm, which augments an on-policy PPO algorithm with a learned recurrent world model that provides improved value targets and short imagination rollouts. Extensive experiments on Llama-3.1-8B, Qwen3-8B, and Mistral-12B show that ECLD compresses base models by about 70-80% in storage (i.e., from 15.3 GB to 3.3 GB for Llama-3.1-8B) and reduces per-query energy consumption by up to 50%, while largely preserving accuracy and often lowering hallucination compared with quantization-only or pruning-only baselines. Moreover, they also show that world model-PPO speeds up convergence by about 50%, improves the final reward by 15.8% over vanilla PPO, and reduces average inference latency by 12-30% across different user populations, while satisfying the accuracy and hallucination constraints and approaching the generation quality of always-offloading with much of the efficiency of local execution.
Paper Structure (44 sections, 48 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 44 sections, 48 equations, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: Overview of research contents in this paper, including the workflow of the proposed ECLD framework, the formulation of LLM inference offloading problem, the development of a world model-PPO algorithm for dynamic offloading, and comprehensive validation on a real MEC testbed.
  • Figure 2: Workflow of the proposed ECLD framework for compact LLM deployment including four stages. Stage I is the sequential model pruning process for model size reduction. Stage II is the model distillation process through knowledge distillation for performance recovery. Stage III is the model quantization process for hardware efficiency. Stage IV is the optimized model deployment process tailored to resource-constrained mobile and edge devices.
  • Figure 3: System model of cooperative LLM inference. Each mobile LLM user partitions its task between local execution using a compact quantized LLM and remote execution via uplink offloading to an MEC server hosting a distilled LLM.
  • Figure 4: Architecture of the world model-PPO, where the actor–critic networks of PPO are jointly updated with a lightweight RSSM-based world model that predicts latent dynamics, reconstructs observations, and supports imagination-based policy improvement.
  • Figure 5: Hardware setup of the compact LLM offloading testbed. The MEC server is equipped with an Intel Xeon Platinum 8380 CPU, an NVIDIA H200 GPU with 128 GB of RAM. The user devices include: a) An Nvidia Jetson Nano B01 Developer Kit 4 GB with dual antennas for wireless communications and a 128-core Maxwell GPU. b) A Xiaomi 10 Ultra smartphone with 12GB of RAM and a Qualcomm Snapdragon 865 processor.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Remark 1
  • Example 1