Efficient LLM Collaboration via Planning
Byeongchan Lee, Jonghoon Lee, Dongyoung Kim, Jaehyung Kim, Kyungjoon Park, Dongjun Lee, Jinwoo Shin
TL;DR
COPE addresses the high cost of large LLMs by enabling efficient test-time collaboration between small open-source models and large proprietary models through planning. It structures collaboration as a three-stage cascade where planners and executors exchange lightweight plans, escalating only when confidence thresholds are not met. Across math reasoning, code generation, open-ended tasks, and agent tasks, COPE matches or surpasses large-model baselines while drastically reducing cost, demonstrating the practical impact of planning as a cost-efficient inference prior. The approach is training-free and easy to deploy on edge-cloud deployments, highlighting planning as a scalable strategy for accessible, high-quality LLM inference.
Abstract
Recently, large language models (LLMs) have demonstrated strong performance, ranging from simple to complex tasks. However, while large proprietary models (e.g., models with over 100B parameters) achieve remarkable results across diverse tasks, they are often accessible through costly APIs, making frequent use too costly for many applications. In contrast, small open-source models (e.g., models with fewer than 3B parameters) are freely available and easy to deploy locally, but their performance on complex tasks remains limited. This trade-off raises a natural question: how can small and large models efficiently collaborate to combine their complementary strengths? To bridge this trade-off, we propose COPE, a test-time collaboration framework. A planner model first generates a plan that serves as a lightweight intermediate that guides a downstream executor model. Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks. Through comprehensive experiments on benchmarks spanning mathematical reasoning, code generation, open-ended tasks, and agent tasks, we demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost. These results highlight planning as an effective prior for cost-efficient inference.
