Table of Contents
Fetching ...

An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation

Shubham Gandhi, Atharva Naik, Yiqing Xie, Carolyn Rose

TL;DR

The paper addresses the cost of deploying repository-level code generation by exploring a taxonomy of strong–weak collaboration strategies across static context augmentation, pipeline division, and dynamic routing. It empirically evaluates 12 methods on SWE-Bench Lite using an Agentless-Lite framework to characterize cost–performance trade-offs and provide practical guidelines for budget-constrained deployments. Key findings show that pipeline- and context-based approaches are typically the most cost-efficient, with no universal winner across all budgets; the best configurations can approach strong-model performance at a fraction of the cost, as demonstrated by cost-performance curves. These insights offer actionable strategies for designing scalable, cost-aware LLM systems for complex code-generation tasks in real-world settings.

Abstract

We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most challenging tasks are delegated to the strong model. While many works propose architectures for this task, few analyze performance relative to cost. We evaluate a broad spectrum of collaboration strategies: context-based, pipeline-based, and dynamic, on GitHub issue resolution. Our most effective collaborative strategy achieves equivalent performance to the strong model while reducing the cost by 40%. Based on our findings, we offer actionable guidelines for choosing collaboration strategies under varying budget and performance constraints. Our results show that strong-weak collaboration substantially boosts the weak model's performance at a fraction of the cost, pipeline and context-based methods being most efficient. We release the code for our work at https://github.com/shubhamrgandhi/codegen-strong-weak-collab.

An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation

TL;DR

The paper addresses the cost of deploying repository-level code generation by exploring a taxonomy of strong–weak collaboration strategies across static context augmentation, pipeline division, and dynamic routing. It empirically evaluates 12 methods on SWE-Bench Lite using an Agentless-Lite framework to characterize cost–performance trade-offs and provide practical guidelines for budget-constrained deployments. Key findings show that pipeline- and context-based approaches are typically the most cost-efficient, with no universal winner across all budgets; the best configurations can approach strong-model performance at a fraction of the cost, as demonstrated by cost-performance curves. These insights offer actionable strategies for designing scalable, cost-aware LLM systems for complex code-generation tasks in real-world settings.

Abstract

We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most challenging tasks are delegated to the strong model. While many works propose architectures for this task, few analyze performance relative to cost. We evaluate a broad spectrum of collaboration strategies: context-based, pipeline-based, and dynamic, on GitHub issue resolution. Our most effective collaborative strategy achieves equivalent performance to the strong model while reducing the cost by 40%. Based on our findings, we offer actionable guidelines for choosing collaboration strategies under varying budget and performance constraints. Our results show that strong-weak collaboration substantially boosts the weak model's performance at a fraction of the cost, pipeline and context-based methods being most efficient. We release the code for our work at https://github.com/shubhamrgandhi/codegen-strong-weak-collab.

Paper Structure

This paper contains 24 sections, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Taxonomy of the 14 techniques studied. * denotes methods newly proposed or adapted in this study. We categorize them into cost-equated weak-only, context-based, pipeline-based, and dynamic collaboration methods.
  • Figure 2: Performance vs. Cost curves for different Strong-Weak Model Pairs. O3 - O3-mini; O4 - O4-mini; 4o - GPT-4o-mini; Qx - Qwen2.5-Coder-xB. Detailed pairwise results in Appendix \ref{['fig:perf_cost_comparison']} and Appendix Tables \ref{['tab:results-o4-mini-gpt-4o-mini']}-\ref{['tab:results-gpt-4o-mini-qwen25coder7b']}.
  • Figure 3: The Agentless Lite Framework: RAG + Code Generation
  • Figure 4: Performance vs. cost comparison across different Strong-Weak LM pairs, denoted as (Strong LM + Weak LM). The line denotes a monotonically increasing curve, i.e. what is the best method based on performance given a particular cost budget.
  • Figure 5: Heatmaps of issue category wise performance for O4-mini + GPT-4o-mini model pair.