Table of Contents
Fetching ...

Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings

Liangqi Yuan, Dong-Jun Han, Shiqiang Wang, Christopher G. Brinton

TL;DR

This work addresses the challenge of deploying LLMs in real-world, resource-constrained environments by proposing TMO, a local-cloud inference system that offloads across three dimensions: modality, task, and dialogue. TMO uses a resource-constrained reinforcement learning framework to jointly optimize the choice of local vs cloud LLMs and the modalities to upload, aiming to maximize long-term reward that combines response quality, latency, and cost. A key innovation is the offline learning approach with a nearest-neighbor response-score estimator to handle uncertainty, and the M4A1 dataset provides a comprehensive benchmark across modalities, tasks, dialogues, and LLM configurations. Results show that TMO, especially with RC-A2C, achieves superior overall rewards and effectively trades off performance against latency and cost, demonstrating potential for practical, scalable LLM-assisted systems in multimodal, multi-task, and multi-dialogue settings.

Abstract

Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for TMO that optimizes the inference location (i.e., local vs. cloud) and multi-modal data sources to use for each task/dialogue, aiming to maximize the long-term reward (response quality, latency, and usage cost) while adhering to resource constraints. We also contribute M4A1, a new dataset we curated that contains reward and cost metrics across multiple modality, task, dialogue, and LLM configurations, enabling evaluation of offloading decisions. We demonstrate the effectiveness of TMO compared to several exploration-decision and LLM-as-Agent baselines, showing significant improvements in latency, cost, and response quality.

Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings

TL;DR

This work addresses the challenge of deploying LLMs in real-world, resource-constrained environments by proposing TMO, a local-cloud inference system that offloads across three dimensions: modality, task, and dialogue. TMO uses a resource-constrained reinforcement learning framework to jointly optimize the choice of local vs cloud LLMs and the modalities to upload, aiming to maximize long-term reward that combines response quality, latency, and cost. A key innovation is the offline learning approach with a nearest-neighbor response-score estimator to handle uncertainty, and the M4A1 dataset provides a comprehensive benchmark across modalities, tasks, dialogues, and LLM configurations. Results show that TMO, especially with RC-A2C, achieves superior overall rewards and effectively trades off performance against latency and cost, demonstrating potential for practical, scalable LLM-assisted systems in multimodal, multi-task, and multi-dialogue settings.

Abstract

Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for TMO that optimizes the inference location (i.e., local vs. cloud) and multi-modal data sources to use for each task/dialogue, aiming to maximize the long-term reward (response quality, latency, and usage cost) while adhering to resource constraints. We also contribute M4A1, a new dataset we curated that contains reward and cost metrics across multiple modality, task, dialogue, and LLM configurations, enabling evaluation of offloading decisions. We demonstrate the effectiveness of TMO compared to several exploration-decision and LLM-as-Agent baselines, showing significant improvements in latency, cost, and response quality.

Paper Structure

This paper contains 31 sections, 15 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Application scenario of the TMO system as an LLM Assistant. In this example, cloud LLM is selected along with first-person and overhead views, as the query requires visual information.
  • Figure 2: Illustration of RCRL within the TMO System.
  • Figure 3: Illustration of M4A1 Dataset.
  • Figure 4: Effect of Constraints. TMO demonstrates a faster reduction in resource constraint violations compared to baselines.
  • Figure 5: Effect of local device performance. TMO achieves superior performance on all local device compared to baselines.
  • ...and 1 more figures