Table of Contents
Fetching ...

Octo-planner: On-device Language Model for Planner-Action Agents

Wei Chen, Zhiyuan Li, Zhen Guo, Yikang Shen

TL;DR

This work targets practical, private, on-device AI by introducing Octo-planner, a Planner-Action framework that offloads planning to a dedicated edge-optimized planner ($3.8\times10^9$ parameters) and delegates execution to an on-device Octopus-based action agent. Planning data are generated and validated with GPT-4 to fine-tune the planner, achieving high in-domain success ($\approx97\%$) and enabling efficient, multi-domain operation via Multi-LoRA merging. The study provides a comprehensive evaluation across full fine-tuning and LoRA variants, base-model sizes, and dataset scales, showing that larger models and richer datasets improve accuracy while multi-domain merging introduces trade-offs. Open-sourcing the weights encourages practical edge-AI deployment, highlighting the potential for privacy-preserving, low-latency autonomous agents on mobile devices and beyond.

Abstract

AI agents have become increasingly significant in various domains, enabling autonomous decision-making and problem-solving. To function effectively, these agents require a planning process that determines the best course of action and then executes the planned actions. In this paper, we present an efficient on-device Planner-Action framework that separates planning and action execution into two distinct components: a planner agent based on Phi-3 Mini, a 3.8 billion parameter LLM optimized for edge devices, and an action agent using the Octopus model for function execution. The planner agent first responds to user queries by decomposing tasks into a sequence of sub-steps, which are then executed by the action agent. To optimize performance on resource-constrained devices, we employ model fine-tuning instead of in-context learning, reducing computational costs and energy consumption while improving response times. Our approach involves using GPT-4 to generate diverse planning queries and responses based on available functions, with subsequent validations to ensure data quality. We fine-tune the Phi-3 Mini model on this curated dataset, achieving a 97\% success rate in our in-domain test environment. To address multi-domain planning challenges, we developed a multi-LoRA training method that merges weights from LoRAs trained on distinct function subsets. This approach enables flexible handling of complex, multi-domain queries while maintaining computational efficiency on resource-constrained devices. To support further research, we have open-sourced our model weights at \url{https://huggingface.co/NexaAIDev/octopus-planning}. For the demo, please refer to \url{https://www.nexa4ai.com/octo-planner}.

Octo-planner: On-device Language Model for Planner-Action Agents

TL;DR

This work targets practical, private, on-device AI by introducing Octo-planner, a Planner-Action framework that offloads planning to a dedicated edge-optimized planner ( parameters) and delegates execution to an on-device Octopus-based action agent. Planning data are generated and validated with GPT-4 to fine-tune the planner, achieving high in-domain success () and enabling efficient, multi-domain operation via Multi-LoRA merging. The study provides a comprehensive evaluation across full fine-tuning and LoRA variants, base-model sizes, and dataset scales, showing that larger models and richer datasets improve accuracy while multi-domain merging introduces trade-offs. Open-sourcing the weights encourages practical edge-AI deployment, highlighting the potential for privacy-preserving, low-latency autonomous agents on mobile devices and beyond.

Abstract

AI agents have become increasingly significant in various domains, enabling autonomous decision-making and problem-solving. To function effectively, these agents require a planning process that determines the best course of action and then executes the planned actions. In this paper, we present an efficient on-device Planner-Action framework that separates planning and action execution into two distinct components: a planner agent based on Phi-3 Mini, a 3.8 billion parameter LLM optimized for edge devices, and an action agent using the Octopus model for function execution. The planner agent first responds to user queries by decomposing tasks into a sequence of sub-steps, which are then executed by the action agent. To optimize performance on resource-constrained devices, we employ model fine-tuning instead of in-context learning, reducing computational costs and energy consumption while improving response times. Our approach involves using GPT-4 to generate diverse planning queries and responses based on available functions, with subsequent validations to ensure data quality. We fine-tune the Phi-3 Mini model on this curated dataset, achieving a 97\% success rate in our in-domain test environment. To address multi-domain planning challenges, we developed a multi-LoRA training method that merges weights from LoRAs trained on distinct function subsets. This approach enables flexible handling of complex, multi-domain queries while maintaining computational efficiency on resource-constrained devices. To support further research, we have open-sourced our model weights at \url{https://huggingface.co/NexaAIDev/octopus-planning}. For the demo, please refer to \url{https://www.nexa4ai.com/octo-planner}.

Paper Structure

This paper contains 15 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Planner-Action Agent in smartphone using Octopus models
  • Figure 2: Comparison of Single LLM Agent and Planner-Action Agent frameworks. (left) Single LLM Agent: A unified model performs both task planning and action execution. (right) Planner-Action Agent: A specialized planner model decomposes the task into subtasks, while a separate action model executes each subtask sequentially.
  • Figure 3: Dataset Collection Process for Planner Model Training. First, we identify the number of steps required, setting $N$ from 1 to 5 in our current case. Next, we generate corresponding queries and steps for each query.