Table of Contents
Fetching ...

Design and Evaluation of Cost-Aware PoQ for Decentralized LLM Inference

Arther Tian, Alex Ding, Frank Chen, Alan Wu, Aaron Chan, Bruce Zhang

TL;DR

This paper extends the Proof of Quality (PoQ) framework for decentralized LLM inference by incorporating explicit cost considerations into the reward mechanism for both inference and evaluator nodes. It blends ground-truth token-level F1, lightweight evaluators, and GPT-based judgments within a linear reward scheme to balance quality and efficiency. Empirical results show that STS-based bi-encoders align best with objective and subjective quality signals, and that larger yet efficient models can outperform smaller counterparts when cost is accounted for. Monte Carlo simulations demonstrate improved incentive alignment toward high-quality, low-cost inferences and efficient evaluators, suggesting a practical path to economically sustainable decentralized LLM inference. The work also provides deployment guidelines and highlights limitations and future directions for more heterogeneous and adversarial environments.

Abstract

Decentralized large language model (LLM) inference promises transparent and censorship resistant access to advanced AI, yet existing verification approaches struggle to scale to modern models. Proof of Quality (PoQ) replaces cryptographic verification of computation with consensus over output quality, but the original formulation ignores heterogeneous computational costs across inference and evaluator nodes. This paper introduces a cost-aware PoQ framework that integrates explicit efficiency measurements into the reward mechanism for both types of nodes. The design combines ground truth token level F1, lightweight learned evaluators, and GPT based judgments within a unified evaluation pipeline, and adopts a linear reward function that balances normalized quality and cost. Experiments on extractive question answering and abstractive summarization use five instruction tuned LLMs ranging from TinyLlama-1.1B to Llama-3.2-3B and three evaluation models spanning cross encoder and bi encoder architectures. Results show that a semantic textual similarity bi encoder achieves much higher correlation with both ground truth and GPT scores than cross encoders, indicating that evaluator architecture is a critical design choice for PoQ. Quality-cost analysis further reveals that the largest models in the pool are also the most efficient in terms of quality per unit latency. Monte Carlo simulations over 5\,000 PoQ rounds demonstrate that the cost-aware reward scheme consistently assigns higher average rewards to high quality low cost inference models and to efficient evaluators, while penalizing slow low quality nodes. These findings suggest that cost-aware PoQ provides a practical foundation for economically sustainable decentralized LLM inference.

Design and Evaluation of Cost-Aware PoQ for Decentralized LLM Inference

TL;DR

This paper extends the Proof of Quality (PoQ) framework for decentralized LLM inference by incorporating explicit cost considerations into the reward mechanism for both inference and evaluator nodes. It blends ground-truth token-level F1, lightweight evaluators, and GPT-based judgments within a linear reward scheme to balance quality and efficiency. Empirical results show that STS-based bi-encoders align best with objective and subjective quality signals, and that larger yet efficient models can outperform smaller counterparts when cost is accounted for. Monte Carlo simulations demonstrate improved incentive alignment toward high-quality, low-cost inferences and efficient evaluators, suggesting a practical path to economically sustainable decentralized LLM inference. The work also provides deployment guidelines and highlights limitations and future directions for more heterogeneous and adversarial environments.

Abstract

Decentralized large language model (LLM) inference promises transparent and censorship resistant access to advanced AI, yet existing verification approaches struggle to scale to modern models. Proof of Quality (PoQ) replaces cryptographic verification of computation with consensus over output quality, but the original formulation ignores heterogeneous computational costs across inference and evaluator nodes. This paper introduces a cost-aware PoQ framework that integrates explicit efficiency measurements into the reward mechanism for both types of nodes. The design combines ground truth token level F1, lightweight learned evaluators, and GPT based judgments within a unified evaluation pipeline, and adopts a linear reward function that balances normalized quality and cost. Experiments on extractive question answering and abstractive summarization use five instruction tuned LLMs ranging from TinyLlama-1.1B to Llama-3.2-3B and three evaluation models spanning cross encoder and bi encoder architectures. Results show that a semantic textual similarity bi encoder achieves much higher correlation with both ground truth and GPT scores than cross encoders, indicating that evaluator architecture is a critical design choice for PoQ. Quality-cost analysis further reveals that the largest models in the pool are also the most efficient in terms of quality per unit latency. Monte Carlo simulations over 5\,000 PoQ rounds demonstrate that the cost-aware reward scheme consistently assigns higher average rewards to high quality low cost inference models and to efficient evaluators, while penalizing slow low quality nodes. These findings suggest that cost-aware PoQ provides a practical foundation for economically sustainable decentralized LLM inference.

Paper Structure

This paper contains 23 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of inference verification paradigms in blockchain environments. (a) Proof of Quality (PoQ) employs multiple lightweight evaluators to assess output quality with minimal overhead. (b) OPML requires expensive VM validation taking minutes to hours. (c) ZKML demands intensive computation for proof generation, often requiring hours for completion. (d) Vanilla inference lacks any verification mechanism, making it unsuitable for trustless environments.
  • Figure 2: Conceptual illustration of the quality-cost trade-off in LLM inference. (a) Output quality increases with model size but exhibits diminishing returns. (b) Computational cost grows disproportionately with model scale. (c) The quality-to-cost efficiency ratio decreases as models become larger, highlighting the need for cost-aware incentive mechanisms. Traditional PoQ rewards based solely on quality would favor giant models despite their poor efficiency.
  • Figure 3: Overall architecture of the cost-aware PoQ pipeline. Inputs from SQuAD and CNN/DailyMail are processed by a set of inference nodes denoted as F nodes, evaluated by lightweight models denoted as M nodes, and aggregated in a consensus and rewards layer that computes quality scores and cost-aware incentives.
  • Figure 4: Correlation between evaluator scores and reference signals. Bars show Pearson correlation with ground truth F1 scores and GPT based judgments, averaged over question answering and summarization tasks.
  • Figure 5: Quality--cost trade offs for inference models. Left: average ground truth F1 score versus latency. Right: average GPT based score versus latency. Each point corresponds to a single inference model.
  • ...and 1 more figures