Compositional Hardness of Code in Large Language Models -- A Probabilistic Perspective
Yotam Wolf, Binyamin Rothberg, Dorin Shteyman, Amnon Shashua
TL;DR
This work identifies an in-context hardness of composition in large language models for coding tasks, where solving multiple subproblems within a single context dramatically increases the generation complexity. By modeling LLMs as autoregressive generators subject to noise from mixing subproblems (screening), the authors prove an exponential gap between solving a composite problem in one context versus distributing subproblems across multiple agents. They formalize generation complexity $N(P,x)$, derive bounds showing $N(P,(x_1\oplus x_2))$ can exceed $N(P,x_1)N(P,x_2)$ by a factor exponential in the total sub-solution length, and validate these results empirically on several models and composite datasets. The findings advocate multi-agent, task-distribution strategies for complex coding tasks and offer a lens on effective context length through the screening phenomenon.
Abstract
A common practice in large language model (LLM) usage for complex analytical tasks such as code generation, is to sample a solution for the entire task within the model's context window. Previous works have shown that subtask decomposition within the model's context (chain of thought), is beneficial for solving such tasks. In this work, we point a limitation of LLMs' ability to perform several sub-tasks within the same context window - an in-context hardness of composition, pointing to an advantage for distributing a decomposed problem in a multi-agent system of LLMs. The hardness of composition is quantified by a generation complexity metric, i.e., the number of LLM generations required to sample at least one correct solution. We find a gap between the generation complexity of solving a compositional problem within the same context relative to distributing it among multiple agents, that increases exponentially with the solution's length. We prove our results theoretically and demonstrate them empirically.
