Table of Contents
Fetching ...

Efficient LLM Collaboration via Planning

Byeongchan Lee, Jonghoon Lee, Dongyoung Kim, Jaehyung Kim, Kyungjoon Park, Dongjun Lee, Jinwoo Shin

TL;DR

COPE addresses the high cost of large LLMs by enabling efficient test-time collaboration between small open-source models and large proprietary models through planning. It structures collaboration as a three-stage cascade where planners and executors exchange lightweight plans, escalating only when confidence thresholds are not met. Across math reasoning, code generation, open-ended tasks, and agent tasks, COPE matches or surpasses large-model baselines while drastically reducing cost, demonstrating the practical impact of planning as a cost-efficient inference prior. The approach is training-free and easy to deploy on edge-cloud deployments, highlighting planning as a scalable strategy for accessible, high-quality LLM inference.

Abstract

Recently, large language models (LLMs) have demonstrated strong performance, ranging from simple to complex tasks. However, while large proprietary models (e.g., models with over 100B parameters) achieve remarkable results across diverse tasks, they are often accessible through costly APIs, making frequent use too costly for many applications. In contrast, small open-source models (e.g., models with fewer than 3B parameters) are freely available and easy to deploy locally, but their performance on complex tasks remains limited. This trade-off raises a natural question: how can small and large models efficiently collaborate to combine their complementary strengths? To bridge this trade-off, we propose COPE, a test-time collaboration framework. A planner model first generates a plan that serves as a lightweight intermediate that guides a downstream executor model. Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks. Through comprehensive experiments on benchmarks spanning mathematical reasoning, code generation, open-ended tasks, and agent tasks, we demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost. These results highlight planning as an effective prior for cost-efficient inference.

Efficient LLM Collaboration via Planning

TL;DR

COPE addresses the high cost of large LLMs by enabling efficient test-time collaboration between small open-source models and large proprietary models through planning. It structures collaboration as a three-stage cascade where planners and executors exchange lightweight plans, escalating only when confidence thresholds are not met. Across math reasoning, code generation, open-ended tasks, and agent tasks, COPE matches or surpasses large-model baselines while drastically reducing cost, demonstrating the practical impact of planning as a cost-efficient inference prior. The approach is training-free and easy to deploy on edge-cloud deployments, highlighting planning as a scalable strategy for accessible, high-quality LLM inference.

Abstract

Recently, large language models (LLMs) have demonstrated strong performance, ranging from simple to complex tasks. However, while large proprietary models (e.g., models with over 100B parameters) achieve remarkable results across diverse tasks, they are often accessible through costly APIs, making frequent use too costly for many applications. In contrast, small open-source models (e.g., models with fewer than 3B parameters) are freely available and easy to deploy locally, but their performance on complex tasks remains limited. This trade-off raises a natural question: how can small and large models efficiently collaborate to combine their complementary strengths? To bridge this trade-off, we propose COPE, a test-time collaboration framework. A planner model first generates a plan that serves as a lightweight intermediate that guides a downstream executor model. Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks. Through comprehensive experiments on benchmarks spanning mathematical reasoning, code generation, open-ended tasks, and agent tasks, we demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost. These results highlight planning as an effective prior for cost-efficient inference.

Paper Structure

This paper contains 49 sections, 5 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Overall framework of COPE. The system proceeds in up to three stages of inference, where small and large models alternate roles as planner and executor. In each stage, given a task, a plan is generated by the planner, and the executor produces candidate outputs. If a task-specific confidence falls below the stage threshold, the task escalates to the next stage. Plans generated in earlier stages are retained and reused in later stages.
  • Figure 2: Comparison between vanilla inference and planning-guided inference with COPE. The vanilla model fails to account for the divisibility and inequality constraints, leading to incorrect reasoning (red). In contrast, COPE’s planner highlights these constraints explicitly (green), allowing the executor to follow a structured solution path.
  • Figure 3: Common inference module in COPE.
  • Figure 4: Comparison between vanilla and COPE inference on a MATH-500 problem. The vanilla solution (left), generated by EXAONE-3.5-2.4B-Instruct, results in incorrect reasoning (red). In contrast, COPE combines a goal from the same model with a guideline from GPT-4o-mini, highlighting key constraints (green) and guiding the executor to a correct solution.
  • Figure 5: Comparison between vanilla and COPE inference on a MATH-500 problem. The vanilla solution (left), generated by EXAONE-3.5-2.4B-Instruct, results in incorrect reasoning (red). In contrast, COPE combines a goal from the same model, highlighting a key condition (green) and guiding the executor to a correct solution.