Table of Contents
Fetching ...

Fortytwo: Swarm Inference with Peer-Ranked Consensus

Vladyslav Larin, Ihor Naumenko, Aleksei Ivashov, Ivan Nikitin, Alexander Firsov

TL;DR

Fortytwo addresses the scalability bottleneck of centralized AI by distributing inference across a swarm of heterogeneous nodes that jointly generate and rank outputs. The core method combines distributed pairwise ranking via a Bradley-Terry–style aggregation with multi-token reasoning, reputation-weighted consensus, compute-stake Sybil defenses, and on-chain coordination. Key results show substantial gains over majority voting (e.g., GPQA Diamond from 68.69% to 85.90%) and exceptional robustness to adversarial and noisy prompts (0.12% degradation vs 6.20% for monolithic models) across six benchmarks. The work demonstrates that decentralized, meritocratic swarms can deliver high-quality, secure AI inference while democratizing access and maintaining practical deployability, offering a blueprint for collective intelligence in AI systems.

Abstract

As centralized AI hits compute ceilings and diminishing returns from ever-larger training runs, meeting demand requires an inference layer that scales horizontally in both capacity and capability. We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference. Our approach reimagines collaboration among AI nodes using swarm inference: a peer-ranked, reputation-weighted consensus across heterogeneous models that surfaces the highest-quality responses. Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting, achieving 85.90% on GPQA Diamond versus 68.69% for majority voting with the same model set - an improvement of +17.21 percentage points (approximately +25.1% relative). The protocol incorporates on-chain reputation so node influence adapts to demonstrated accuracy over time, yielding a meritocratic consensus that filters low-quality or malicious participants. To resist Sybil attacks, Fortytwo employs proof-of-capability in its consensus: nodes must successfully complete calibration/test requests and stake reputation to enter ranking rounds, making multi-identity attacks economically unattractive while preserving openness. Across six challenging benchmarks, including GPQA Diamond, LiveCodeBench, and AIME, our evaluation indicates higher accuracy and strong resilience to adversarial and noisy free-form prompting (e.g., prompt-injection degradation of only 0.12% versus 6.20% for a monolithic single-model baseline), while retaining practical deployability. Together, these results establish a foundation for decentralized AI systems - democratizing access to high-quality inference through collective intelligence without sacrificing reliability or security.

Fortytwo: Swarm Inference with Peer-Ranked Consensus

TL;DR

Fortytwo addresses the scalability bottleneck of centralized AI by distributing inference across a swarm of heterogeneous nodes that jointly generate and rank outputs. The core method combines distributed pairwise ranking via a Bradley-Terry–style aggregation with multi-token reasoning, reputation-weighted consensus, compute-stake Sybil defenses, and on-chain coordination. Key results show substantial gains over majority voting (e.g., GPQA Diamond from 68.69% to 85.90%) and exceptional robustness to adversarial and noisy prompts (0.12% degradation vs 6.20% for monolithic models) across six benchmarks. The work demonstrates that decentralized, meritocratic swarms can deliver high-quality, secure AI inference while democratizing access and maintaining practical deployability, offering a blueprint for collective intelligence in AI systems.

Abstract

As centralized AI hits compute ceilings and diminishing returns from ever-larger training runs, meeting demand requires an inference layer that scales horizontally in both capacity and capability. We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference. Our approach reimagines collaboration among AI nodes using swarm inference: a peer-ranked, reputation-weighted consensus across heterogeneous models that surfaces the highest-quality responses. Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting, achieving 85.90% on GPQA Diamond versus 68.69% for majority voting with the same model set - an improvement of +17.21 percentage points (approximately +25.1% relative). The protocol incorporates on-chain reputation so node influence adapts to demonstrated accuracy over time, yielding a meritocratic consensus that filters low-quality or malicious participants. To resist Sybil attacks, Fortytwo employs proof-of-capability in its consensus: nodes must successfully complete calibration/test requests and stake reputation to enter ranking rounds, making multi-identity attacks economically unattractive while preserving openness. Across six challenging benchmarks, including GPQA Diamond, LiveCodeBench, and AIME, our evaluation indicates higher accuracy and strong resilience to adversarial and noisy free-form prompting (e.g., prompt-injection degradation of only 0.12% versus 6.20% for a monolithic single-model baseline), while retaining practical deployability. Together, these results establish a foundation for decentralized AI systems - democratizing access to high-quality inference through collective intelligence without sacrificing reliability or security.

Paper Structure

This paper contains 138 sections, 23 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Benchmark performance comparison across models.
  • Figure 2: Modular architecture of a self-supervised inference showing the four key components and their interactions
  • Figure 3: Evolution of node reputation scores over time showing different performance patterns
  • Figure 4: Benchmark performance comparison across models. Fortytwo achieves state-of-the-art results on LiveCode (84.4%), MATH-500 (99.6%), AIME 2024 (100%), and AIME 2025 (96.66%), demonstrating superior performance on challenging reasoning and coding benchmarks.
  • Figure 5: Performance scaling with swarm size showing rapid improvement and convergence
  • ...and 2 more figures