Compositional Hardness of Code in Large Language Models -- A Probabilistic Perspective

Yotam Wolf; Binyamin Rothberg; Dorin Shteyman; Amnon Shashua

Compositional Hardness of Code in Large Language Models -- A Probabilistic Perspective

Yotam Wolf, Binyamin Rothberg, Dorin Shteyman, Amnon Shashua

TL;DR

This work identifies an in-context hardness of composition in large language models for coding tasks, where solving multiple subproblems within a single context dramatically increases the generation complexity. By modeling LLMs as autoregressive generators subject to noise from mixing subproblems (screening), the authors prove an exponential gap between solving a composite problem in one context versus distributing subproblems across multiple agents. They formalize generation complexity $N(P,x)$, derive bounds showing $N(P,(x_1\oplus x_2))$ can exceed $N(P,x_1)N(P,x_2)$ by a factor exponential in the total sub-solution length, and validate these results empirically on several models and composite datasets. The findings advocate multi-agent, task-distribution strategies for complex coding tasks and offer a lens on effective context length through the screening phenomenon.

Abstract

A common practice in large language model (LLM) usage for complex analytical tasks such as code generation, is to sample a solution for the entire task within the model's context window. Previous works have shown that subtask decomposition within the model's context (chain of thought), is beneficial for solving such tasks. In this work, we point a limitation of LLMs' ability to perform several sub-tasks within the same context window - an in-context hardness of composition, pointing to an advantage for distributing a decomposed problem in a multi-agent system of LLMs. The hardness of composition is quantified by a generation complexity metric, i.e., the number of LLM generations required to sample at least one correct solution. We find a gap between the generation complexity of solving a compositional problem within the same context relative to distributing it among multiple agents, that increases exponentially with the solution's length. We prove our results theoretically and demonstrate them empirically.

Compositional Hardness of Code in Large Language Models -- A Probabilistic Perspective

TL;DR

, derive bounds showing

can exceed

by a factor exponential in the total sub-solution length, and validate these results empirically on several models and composite datasets. The findings advocate multi-agent, task-distribution strategies for complex coding tasks and offer a lens on effective context length through the screening phenomenon.

Abstract

Paper Structure (37 sections, 7 theorems, 79 equations, 9 figures)

This paper contains 37 sections, 7 theorems, 79 equations, 9 figures.

Introduction
Related Works
LLMs as solution candidate generators in programming:
Theoretical results on composition:
Effectiveness of LLMs in utilizng long context:
Framework
Generation Complexity
Emergence of compositional coding problems from general coding problems in LLMs with CoT:
Decoupling hardness of decomposition from hardness of composition in general problems:
Screening in Autoregressive Models
Effect of noise on decoding
Results
Generation Complexity For General Problems
Experiments
Generation Complexity Results
...and 22 more sections

Key Result

Lemma 1

Let $\epsilon,\delta\in(0,1)$, and $M> 0$. Let $x$ be a compositional problem and $y_1 \oplus y_2$ a solution, with $x_1$, $x_2$ being the corresponding sub-problems. Suppose that the noise injected to the logits as defined in equation eq:weighted_noise, satisfies assumption assumption:1, and that t Where $P(y_1\oplus y_2|x)$ is the probability of producing the answer $y_1\oplus y_2$, given contex

Figures (9)

Figure 1: Cumulative distribution function for the ratio of generation complexity using composition, $N(P,x)$, to product of generation complexities for the standalone problems, $N(P,x_1)\cdot N(P,x_2)$ (corresponding to the multi-agent generation complexity). The x axis denotes values for the ratio of generation complexities, the y axis is the percentage of problems in which the ratio is no larger than this value (e.g. for $a=5$, the y axis value is the percentage of problems where composition requires up to $\times 5$ more samples than the multi-agent case). (a) For the human eval composition. As seen, in most cases composition requires twice more samples, and sometimes up to 10 times more samples. (b) For the code contests composition. As seen, the majority of problems have a factor of at least 5, and some up to 20.
Figure 2: Ratio of correct solution probability with vs without composition. An exponential trend is w.r.t. function length is observed.
Figure 3: Change in logits of correct tokens minus incorrect tokens due to composition.
Figure 4: Cumulative distribution function for the ratio of generation complexity using composition, $N(P,x)$, to product of generation complexities for the standalone problems, $N(P,x_1)\cdot N(P,x_2)$ (corresponding to the multi-agent generation complexity). The x axis denotes values for the ratio of generation numbers required to solve the problem in the two cases (composition vs multi-agent), the y axis is the percentage of problems in which the ratio is no larger than this value (e.g. for $a=5$, the y axis value is the percentage of problems where composition requires up to $\times 5$ more samples than the multi-agent case). As can be seen in most of the cases, composition requires twice more samples, and for some problems 10 times more samples.
Figure 5: Cumulative distribution function for the ratio of generation complexity using composition, $N(P,x)$, to product of generation complexities for the standalone problems, $N(P,x_1)\cdot N(P,x_2)$ (corresponding to the multi-agent generation complexity). The x axis denotes values for the ratio of generation numbers required to solve the problem in the two cases (composition vs multi-agent), the y axis is the percentage of problems in which the ratio is no larger than this value (e.g. for $a=5$, the y axis value is the percentage of problems where composition requires up to $\times 5$ more samples than the multi-agent case). As can be seen in most of the cases, composition requires twice more samples, and for some problems 10 times more samples.
...and 4 more figures

Theorems & Definitions (9)

Definition 1
Definition 2
Lemma 1
Theorem 1
Theorem 2
Lemma 2
Lemma 3
Theorem 3
Lemma 4

Compositional Hardness of Code in Large Language Models -- A Probabilistic Perspective

TL;DR

Abstract

Compositional Hardness of Code in Large Language Models -- A Probabilistic Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (9)