Table of Contents
Fetching ...

Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents

Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu

TL;DR

The paper addresses the practical challenge of deploying on-device AI agents by introducing Division-of-Thoughts (DoT), a three-part framework that decomposes user queries into subtasks, schedules them via a dependency graph, and allocates each subtask to either local SLMs or cloud LLMs using a detachable plug-and-play adapter. A self-reinforced training method based on an $\alpha$-Tree dataset guides subtask allocation without modifying base model parameters, enabling cost-efficient, parallel edge-cloud reasoning. Empirical results across seven benchmarks show substantial reductions in reasoning time and API cost while maintaining competitive accuracy, demonstrating the approach's scalability and practicality for privacy-preserving on-device AI. Overall, DoT provides a principled, generalizable strategy for efficient hybrid inference in resource-constrained environments with strong cost-performance gains.

Abstract

The rapid expansion of web content has made on-device AI assistants indispensable for helping users manage the increasing complexity of online tasks. The emergent reasoning ability in large language models offer a promising path for next-generation on-device AI agents. However, deploying full-scale Large Language Models (LLMs) on resource-limited local devices is challenging. In this paper, we propose Division-of-Thoughts (DoT), a collaborative reasoning framework leveraging the synergy between locally deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT leverages a Task Decomposer to elicit the inherent planning abilities in language models to decompose user queries into smaller sub-tasks, which allows hybrid language models to fully exploit their respective strengths. Besides, DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks and create a dependency graph, facilitating parallel reasoning of sub-tasks and the identification of key steps. To allocate the appropriate model based on the difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an additional task head attached to the SLM that does not alter the SLM's parameters. To boost adapter's task allocation capability, we propose a self-reinforced training method that relies solely on task execution feedback. Extensive experiments on various benchmarks demonstrate that our DoT significantly reduces LLM costs while maintaining competitive reasoning accuracy. Specifically, DoT reduces the average reasoning time and API costs by 66.12% and 83.57%, while achieving comparable reasoning accuracy with the best baseline methods.

Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents

TL;DR

The paper addresses the practical challenge of deploying on-device AI agents by introducing Division-of-Thoughts (DoT), a three-part framework that decomposes user queries into subtasks, schedules them via a dependency graph, and allocates each subtask to either local SLMs or cloud LLMs using a detachable plug-and-play adapter. A self-reinforced training method based on an -Tree dataset guides subtask allocation without modifying base model parameters, enabling cost-efficient, parallel edge-cloud reasoning. Empirical results across seven benchmarks show substantial reductions in reasoning time and API cost while maintaining competitive accuracy, demonstrating the approach's scalability and practicality for privacy-preserving on-device AI. Overall, DoT provides a principled, generalizable strategy for efficient hybrid inference in resource-constrained environments with strong cost-performance gains.

Abstract

The rapid expansion of web content has made on-device AI assistants indispensable for helping users manage the increasing complexity of online tasks. The emergent reasoning ability in large language models offer a promising path for next-generation on-device AI agents. However, deploying full-scale Large Language Models (LLMs) on resource-limited local devices is challenging. In this paper, we propose Division-of-Thoughts (DoT), a collaborative reasoning framework leveraging the synergy between locally deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT leverages a Task Decomposer to elicit the inherent planning abilities in language models to decompose user queries into smaller sub-tasks, which allows hybrid language models to fully exploit their respective strengths. Besides, DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks and create a dependency graph, facilitating parallel reasoning of sub-tasks and the identification of key steps. To allocate the appropriate model based on the difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an additional task head attached to the SLM that does not alter the SLM's parameters. To boost adapter's task allocation capability, we propose a self-reinforced training method that relies solely on task execution feedback. Extensive experiments on various benchmarks demonstrate that our DoT significantly reduces LLM costs while maintaining competitive reasoning accuracy. Specifically, DoT reduces the average reasoning time and API costs by 66.12% and 83.57%, while achieving comparable reasoning accuracy with the best baseline methods.

Paper Structure

This paper contains 30 sections, 1 equation, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of Our Proposed DoT Framework.
  • Figure 2: Advantage of "Division-and-Allocate" Strategy
  • Figure 3: Illustrating Dependency Graph of Task Scheduling
  • Figure 4: Tree Search-Based Dataset Construction Process
  • Figure 5: Proportion of SLMs in time cost and # sub-tasks.
  • ...and 1 more figures