Table of Contents
Fetching ...

Power and Limitations of Aggregation in Compound AI Systems

Nivasini Ananthakrishnan, Meena Jagadeesan

TL;DR

This work investigates the power and limitations of aggregation within a stylized principal-agent framework and proves that any aggregation operation must implement one of these mechanisms in order to be elicitability-expanding, and that strengthened versions of these mechanisms provide necessary and sufficient conditions that fully characterize elicitability-expansion.

Abstract

When designing compound AI systems, a common approach is to query multiple copies of the same model and aggregate the responses to produce a synthesized output. Given the homogeneity of these models, this raises the question of whether aggregation unlocks access to a greater set of outputs than querying a single model. In this work, we investigate the power and limitations of aggregation within a stylized principal-agent framework. This framework models how the system designer can partially steer each agent's output through its reward function specification, but still faces limitations due to prompt engineering ability and model capabilities. Our analysis uncovers three natural mechanisms -- feasibility expansion, support expansion, and binding set contraction -- through which aggregation expands the set of outputs that are elicitable by the system designer. We prove that any aggregation operation must implement one of these mechanisms in order to be elicitability-expanding, and that strengthened versions of these mechanisms provide necessary and sufficient conditions that fully characterize elicitability-expansion. Finally, we provide an empirical illustration of our findings for LLMs deployed in a toy reference-generation task. Altogether, our results take a step towards characterizing when compound AI systems can overcome limitations in model capabilities and in prompt engineering.

Power and Limitations of Aggregation in Compound AI Systems

TL;DR

This work investigates the power and limitations of aggregation within a stylized principal-agent framework and proves that any aggregation operation must implement one of these mechanisms in order to be elicitability-expanding, and that strengthened versions of these mechanisms provide necessary and sufficient conditions that fully characterize elicitability-expansion.

Abstract

When designing compound AI systems, a common approach is to query multiple copies of the same model and aggregate the responses to produce a synthesized output. Given the homogeneity of these models, this raises the question of whether aggregation unlocks access to a greater set of outputs than querying a single model. In this work, we investigate the power and limitations of aggregation within a stylized principal-agent framework. This framework models how the system designer can partially steer each agent's output through its reward function specification, but still faces limitations due to prompt engineering ability and model capabilities. Our analysis uncovers three natural mechanisms -- feasibility expansion, support expansion, and binding set contraction -- through which aggregation expands the set of outputs that are elicitable by the system designer. We prove that any aggregation operation must implement one of these mechanisms in order to be elicitability-expanding, and that strengthened versions of these mechanisms provide necessary and sufficient conditions that fully characterize elicitability-expansion. Finally, we provide an empirical illustration of our findings for LLMs deployed in a toy reference-generation task. Altogether, our results take a step towards characterizing when compound AI systems can overcome limitations in model capabilities and in prompt engineering.
Paper Structure (90 sections, 20 theorems, 32 equations, 3 figures, 5 tables)

This paper contains 90 sections, 20 theorems, 32 equations, 3 figures, 5 tables.

Key Result

Theorem 3.7

Fix conic constraints $\boldsymbol{C}$, and any aggregation operation ${\boldsymbol{x}}^{(1)}, \ldots, {\boldsymbol{x}}^{(K)} \rightarrow {\boldsymbol{x}}^{(A)}$ where each ${\boldsymbol{x}}^{(k)}$ is feasible (i.e., $\boldsymbol{C} {\boldsymbol{x}}^{(k)} \le 0$, for every $k \in [K]$). If ${\boldsy

Figures (3)

  • Figure 1: Three mechanisms by which the aggregation operation ${\boldsymbol{x}}^{(1)}, {\boldsymbol{x}}^{(2)} \rightarrow {\boldsymbol{x}}^{(A)}$ expands the set of outputs that the system designer can elicit. Feasibility expansion captures when two feasible vectors are aggregated into an infeasible vector (left; \ref{['def:feasibility_expand']}). Support expansion captures when two vectors are aggregated into a vector with richer support (middle; \ref{['def:support_expand']}). Binding set contraction captures when two vectors on the boundary of the feasible set are aggregated into a vector in the interior (right; \ref{['def:bind_contract']}). Any aggregation operation must implement one of these mechanisms to offer power to the system designer (Theorem \ref{['thm:weaker_necessary']}), and strengthened versions of these mechanisms characterize when aggregation adds power (Theorem \ref{['thm:necessary']}, Theorem \ref{['thm:sufficient']}). See Figure \ref{['fig:empirical-mechanisms']} for an empirical illustration of these mechanisms for LLMs in a reference-generation task.
  • Figure 2: Visualization of output vectors for a reference-generation task (Section \ref{['subsec:casestudy']}). Output vectors are computed using the $M=768$-dimensional embeddings from all-mpnet-base-v2, shifted to be in the nonnegative orthant. Embeddings are shown for GPT-4o-mini outputs from five different prompts, and as well as two different aggregated outputs based on additional-style and intersection-style aggregation rules. The $\ell_2$-distances (left) and projections onto the top 3 highest-variance (middle) and top 2 highest-variance dimensions (right), are shown. The plots show that the five prompts produce semantically different outputs, and each aggregation operation results in a combination of the five outputs that does not resemble any output in isolation.
  • Figure 3: Empirical illustration of the three mechanisms in a toy reference-generation task with GPT-4o-mini (Section \ref{['sec:empirical']}). Output dimensions capture paper topics, prompt dimensions capture (potentially higher-level) topics, and aggregation takes a union or an intersection of the lists. The plots show output vectors generated by the individual prompts (${\boldsymbol{x}}^{(1)}$ and ${\boldsymbol{x}}^{(2)}$) and by the aggregation operation (${\boldsymbol{x}}^{(1)}, {\boldsymbol{x}}^{(2)} \rightarrow {\boldsymbol{x}}^{(A)}$), along with the output vector closest to ${\boldsymbol{x}}^{(A)}$ that is elicitable by a single model with prompt topics (${\boldsymbol{x}}^*_P({\boldsymbol{x}}^{(A)})$). The shaded regions show confidence sets. Each plot shows an aggregation operation that implements one of the mechanisms---support expansion (left), binding-set contraction (middle), and feasibility expansion (right). The plots show that these aggregation operations are all elicitability-expanding. The numerical values are shown in Table \ref{['tab:empirical_results']}. This plot is an empirical analogue of Figure \ref{['fig:mechanisms_examples']}.

Theorems & Definitions (51)

  • Definition 2.4: Elicitability-expansion
  • Definition 3.1: Feasibility Expansion
  • Example 3.2
  • Definition 3.3: Support expansion
  • Example 3.4
  • Definition 3.5: Binding set contraction
  • Example 3.6
  • Theorem 3.7
  • Definition 4.1: Feasible, budget-reducing directions
  • Definition 4.2
  • ...and 41 more