StackingNet: Collective Inference Across Independent AI Foundation Models

Siyang Li; Chenhao Liu; Dongrui Wu; Zhigang Zeng; Lieyun Ding

StackingNet: Collective Inference Across Independent AI Foundation Models

Siyang Li, Chenhao Liu, Dongrui Wu, Zhigang Zeng, Lieyun Ding

TL;DR

By turning diversity from a source of inconsistency into collaboration, StackingNet establishes a practical foundation for coordinated artificial intelligence, suggesting that progress may emerge from not only larger single models but also principled cooperation among many specialized ones.

Abstract

Artificial intelligence built on large foundation models has transformed language understanding, vision and reasoning, yet these systems remain isolated and cannot readily share their capabilities. Integrating the complementary strengths of such independent foundation models is essential for building trustworthy intelligent systems. Despite rapid progress in individual model design, there is no established approach for coordinating such black-box heterogeneous models. Here we show that coordination can be achieved through a meta-ensemble framework termed StackingNet, which draws on principles of collective intelligence to combine model predictions during inference. StackingNet improves accuracy, reduces bias, enables reliability ranking, and identifies or prunes models that degrade performance, all operating without access to internal parameters or training data. Across tasks involving language comprehension, visual estimation, and academic paper rating, StackingNet consistently improves accuracy, robustness, and fairness, compared with individual models and classic ensembles. By turning diversity from a source of inconsistency into collaboration, StackingNet establishes a practical foundation for coordinated artificial intelligence, suggesting that progress may emerge from not only larger single models but also principled cooperation among many specialized ones.

StackingNet: Collective Inference Across Independent AI Foundation Models

TL;DR

Abstract

Paper Structure (39 sections, 8 theorems, 30 equations, 10 figures, 5 tables)

This paper contains 39 sections, 8 theorems, 30 equations, 10 figures, 5 tables.

Introduction
Results and Discussion
Conclusions
Methods
Data availability statement
Code availability statement
Acknowledgements
Author contributions statement
Competing interests
Supplementary information
Extended Data
Supplementary Material

Key Result

Theorem 1

Under Assumptions assumption:iid and assumption:independence, the expected squared error of the uniform-weight combination satisfies where $y$ is the ground-truth label, and the expectation is taken over $(\mathbf{x},y) \sim \mathrm{D}$, the unknown test data distribution.

Figures (10)

Figure 1: Collective inference of independent intelligent systems through StackingNet.a, Relationship between collective complexity and cognitive complexity across biological and artificial systems. b, Aggregated inference from multiple independent foundation models across diverse task types. c, StackingNet architecture and its learnable parameters. Each base model is treated as an independent black-box system whose outputs are combined through a trainable meta-learner for regression or classification. d, Functional utilities of StackingNet. The framework combines model outputs, reduces individual bias, estimates reliability with or without supervision, and filters unreliable or adversarial models. Reliability scores are calculated based on publicly available benchmark LMarena (https://lmarena.ai/leaderboard), accessed on Oct. 5, 2025. All scores are shown for illustration only and do not reflect actual model performance.
Figure 1: Research paper rating error by individual human reviewers, individual LLMs, and collective inference of multiple LLMs.a-b, Mean absolute error (MAE; lower values indicate better performance) across two datasets, ICLR2025 and NeurIPS2024. StackingNet was trained in a few-shot setting using 1% of labeled examples (10 papers with ground-truth scores from multiple human reviewers) drawn either from the same year or from the previous year.
Figure 2: Research paper rating error by individual human reviewers, individual LLMs, and collective inference of multiple LLMs.a-d, Mean absolute error (MAE; lower values indicate better performance) across four datasets: ICLR2025, ICLR2024, NeurIPS2024 and NeurIPS2023. Errors for individual humans are computed relative to consensus scores obtained by aggregating multiple human reviewers. StackingNet was trained in a few-shot setting using 1% of randomly selected labeled data (10 examples with ground-truth scores). e, Performance of StackingNet w.r.t. labeled data size for training. f, Distribution of rating errors aggregated across all four datasets.
Figure 2: Facial image attribute ratings by VLMs and StackingNet combination on the Chicago Face Database. Predicted ratings are shown across gender and racial groupings.
Figure 3: Facial attribute ratings by VLMs and StackingNet combination on the Chicago Face Database.a, Heatmaps of MAE of base models across thirteen attributes, stratified by two gender groups and six racial groups. b, Distribution of signed prediction errors for the base models and StackingNet across thirteen attributes in the normalized label space. Predictions were clipped, and the label space was min-max normalized to $[0,1]$. Gender and racial group names were anonymized with numeric identifiers. Groupings are shown only to audit model behavior and do not imply any inherent traits or conclusions about individuals or populations.
...and 5 more figures

Theorems & Definitions (8)

Theorem 1: Error bound for uniform-weight regression combination Breiman1996
Lemma 1: Simplified error reduction factor for uniform-weight regression combination Zhou2012
Theorem 2: Optimal weights for weighted regression combination Perrone1995
Lemma 2: Closed-form optimal weights and error for weighted regression combination Perrone1995
Theorem 3: Monotonicity and convergence of voting with increasing number of classifiers Lam1997
Theorem 4: Optimal weights for weighted classification combination Shapley1984Lam1997
Lemma 3: Covariance between classifiers Parisi2014
Theorem 5: Spectral decomposition for classifier reliability estimation Parisi2014

StackingNet: Collective Inference Across Independent AI Foundation Models

TL;DR

Abstract

StackingNet: Collective Inference Across Independent AI Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (8)