Table of Contents
Fetching ...

LLM Chemistry Estimation for Multi-LLM Recommendation

Huascar Sanchez, Briland Hitaj

TL;DR

This work introduces LLM Chemistry, a framework for quantifying how collaborating LLMs interact on a task, capturing both synergy and interference. It defines a cost-based ensemble objective $cost_Q(X)$ and a chemistry measure $chem_Q(a,b,S)$, and uses Model Interaction Graphs (MIGs) to compute interactions efficiently. The ChemE algorithm computes pairwise chemistry, and the Recommend method selects optimal ensembles via a hill-climbing search over precomputed subsets with a loss that balances inter- and intra-subset chemistry, achieving polynomial-time practicality. Empirical results across three benchmarks show chemistry effects are task- and size-dependent and scale with model diversity, providing a diagnostic and optimization tool for robust multi-LLM systems. Overall, the framework enables chemistry-aware ensemble formation beyond simple top-performer selection, with implications for Mixture-of-Experts and human–machine collaboration.

Abstract

Multi-LLM collaboration promises accurate, robust, and context-aware solutions, yet existing approaches rely on implicit selection and output assessment without analyzing whether collaborating models truly complement or conflict. We introduce LLM Chemistry -- a framework that measures when LLM combinations exhibit synergistic or antagonistic behaviors that shape collective performance beyond individual capabilities. We formalize the notion of chemistry among LLMs, propose algorithms that quantify it by analyzing interaction dependencies, and recommend optimal model ensembles accordingly. Our theoretical analysis shows that chemistry among collaborating LLMs is most evident under heterogeneous model profiles, with its outcome impact shaped by task type, group size, and complexity. Evaluation on classification, summarization, and program repair tasks provides initial evidence for these task-dependent effects, thereby reinforcing our theoretical results. This establishes LLM Chemistry as both a diagnostic factor in multi-LLM systems and a foundation for ensemble recommendation.

LLM Chemistry Estimation for Multi-LLM Recommendation

TL;DR

This work introduces LLM Chemistry, a framework for quantifying how collaborating LLMs interact on a task, capturing both synergy and interference. It defines a cost-based ensemble objective and a chemistry measure , and uses Model Interaction Graphs (MIGs) to compute interactions efficiently. The ChemE algorithm computes pairwise chemistry, and the Recommend method selects optimal ensembles via a hill-climbing search over precomputed subsets with a loss that balances inter- and intra-subset chemistry, achieving polynomial-time practicality. Empirical results across three benchmarks show chemistry effects are task- and size-dependent and scale with model diversity, providing a diagnostic and optimization tool for robust multi-LLM systems. Overall, the framework enables chemistry-aware ensemble formation beyond simple top-performer selection, with implications for Mixture-of-Experts and human–machine collaboration.

Abstract

Multi-LLM collaboration promises accurate, robust, and context-aware solutions, yet existing approaches rely on implicit selection and output assessment without analyzing whether collaborating models truly complement or conflict. We introduce LLM Chemistry -- a framework that measures when LLM combinations exhibit synergistic or antagonistic behaviors that shape collective performance beyond individual capabilities. We formalize the notion of chemistry among LLMs, propose algorithms that quantify it by analyzing interaction dependencies, and recommend optimal model ensembles accordingly. Our theoretical analysis shows that chemistry among collaborating LLMs is most evident under heterogeneous model profiles, with its outcome impact shaped by task type, group size, and complexity. Evaluation on classification, summarization, and program repair tasks provides initial evidence for these task-dependent effects, thereby reinforcing our theoretical results. This establishes LLM Chemistry as both a diagnostic factor in multi-LLM systems and a foundation for ensemble recommendation.

Paper Structure

This paper contains 21 sections, 2 theorems, 8 equations, 7 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

LLM Chemistry emerges in $S$ as a function of the MIG iff models exhibit heterogeneous performance $(q_i,a_i)$ profiles; for (near-)identically performing models, cost-based selection pressure vanishes, so no interaction effects can be detected (chemistry = $0$). (Proof in Appendix sec:theory).

Figures (7)

  • Figure 1: MIG for $S\!=\!\{a,b,c\}$. Underlined elements indicate $\mathit{used}(X)$ (LLMs with $a_i \geq 0.5$). Sample cost values are provided in each node.
  • Figure 2: Illustration of LLM Chemistry estimation process for multi-LLM recommendation. A snapshot of the performance histories is provided in Appendix \ref{['ssec:trial-runs']}.
  • Figure 3: LLM chemistry maps (marginal complementarity, $\Delta\mathrm{CI}$, trade-off parameter $\lambda=0.5$) for Statement Credibility Classification (low complexity, $N = 10$). Rows correspond to strategies (Random, Performance, Remote, Local); the Chemistry row is included for comparison. Columns show Weak, Mid, and Strong ensembles. Weak ensembles display extensive bright regions, indicating substantial chemistry potential and performance gains. Mid ensembles are mixed, with some retaining bright regions and others already saturated. Strong ensembles are almost entirely dark, reflecting saturation where added models are redundant. A few weak panels appear nearly uniform in $\Delta \mathrm{CI}$; these reflect negligible variation and are treated as saturated rather than as broad chemistry.
  • Figure 4: LLM chemistry maps (marginal complementarity, $\Delta\mathrm{CI}$, trade-off parameter $\lambda=0.5$) for Clinical Notes Summarization (medium complexity, $N = 10$). Rows correspond to strategies (Random, Performance, Remote, Local); the Chemistry row is included for comparison. Columns show Weak, Mid, and Strong ensembles. Weak ensembles are mixed: three maps display bright regions (chemistry emergence possible), while two are mostly dark with marginal potential. Mid ensembles are largely dark, with one strategy retaining notable bright regions but most showing only small gains. Strong ensembles are uniformly dark with tiny slivers, indicating near-complete saturation where added models provide little benefit.
  • Figure 5: LLM chemistry maps (marginal complementarity, $\Delta\mathrm{CI}$, trade-off parameter $\lambda=0.5$) for Automated Program Repair (high complexity, $N = 10$). Rows correspond to strategies (Random, Performance, Remote, Local); the Chemistry row is included for comparison. Columns show Weak, Mid, and Strong ensembles. Bright regions are nearly absent across all ensembles, indicating saturation ($\Delta\mathrm{CI} \approx 0$) where added models are redundant, chemistry emergence is negligible, and performance plateaus. Occasional uniform panels with vanishing $\Delta \mathrm{CI}$ variation are likewise interpreted as saturated.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Definition 1: Benefit
  • Definition 2: LLM Chemistry
  • Theorem 1
  • Corollary 1
  • proof
  • proof
  • proof
  • proof
  • proof