LLM Chemistry Estimation for Multi-LLM Recommendation
Huascar Sanchez, Briland Hitaj
TL;DR
This work introduces LLM Chemistry, a framework for quantifying how collaborating LLMs interact on a task, capturing both synergy and interference. It defines a cost-based ensemble objective $cost_Q(X)$ and a chemistry measure $chem_Q(a,b,S)$, and uses Model Interaction Graphs (MIGs) to compute interactions efficiently. The ChemE algorithm computes pairwise chemistry, and the Recommend method selects optimal ensembles via a hill-climbing search over precomputed subsets with a loss that balances inter- and intra-subset chemistry, achieving polynomial-time practicality. Empirical results across three benchmarks show chemistry effects are task- and size-dependent and scale with model diversity, providing a diagnostic and optimization tool for robust multi-LLM systems. Overall, the framework enables chemistry-aware ensemble formation beyond simple top-performer selection, with implications for Mixture-of-Experts and human–machine collaboration.
Abstract
Multi-LLM collaboration promises accurate, robust, and context-aware solutions, yet existing approaches rely on implicit selection and output assessment without analyzing whether collaborating models truly complement or conflict. We introduce LLM Chemistry -- a framework that measures when LLM combinations exhibit synergistic or antagonistic behaviors that shape collective performance beyond individual capabilities. We formalize the notion of chemistry among LLMs, propose algorithms that quantify it by analyzing interaction dependencies, and recommend optimal model ensembles accordingly. Our theoretical analysis shows that chemistry among collaborating LLMs is most evident under heterogeneous model profiles, with its outcome impact shaped by task type, group size, and complexity. Evaluation on classification, summarization, and program repair tasks provides initial evidence for these task-dependent effects, thereby reinforcing our theoretical results. This establishes LLM Chemistry as both a diagnostic factor in multi-LLM systems and a foundation for ensemble recommendation.
