Table of Contents
Fetching ...

A Multi-Dimensional Quality Scoring Framework for Decentralized LLM Inference with Proof of Quality

Arther Tian, Alex Ding, Frank Chen, Simon Wu, Aaron Chan

TL;DR

A multi-dimensional quality scoring framework that decomposes output quality into modular dimensions, including model and cost priors, structure quality, semantic quality, query-output alignment, and agreement/uncertainty is proposed.

Abstract

Decentralized large language model (LLM) inference networks can pool heterogeneous compute to scale serving, but they require lightweight and incentive-compatible mechanisms to assess output quality. Prior work introduced cost-aware Proof of Quality (PoQ) and adaptive robust PoQ to allocate rewards under evaluator heterogeneity and adversarial behavior. In this paper, we focus on the quality signal itself and propose a multi-dimensional quality scoring framework that decomposes output quality into modular dimensions, including model and cost priors, structure quality, semantic quality, query-output alignment, and agreement/uncertainty. Using logged outputs from QA and summarization tasks, we systematically audit dimension reliability and show that seemingly reasonable dimensions can be task-dependent and even negatively correlated with reference quality without calibration. While the default composite underperforms a strong single semantic evaluator, ablations reveal that removing unreliable dimensions and re-normalizing weights yields a calibrated composite that matches or exceeds the best single- evaluator and consensus baselines. Finally, we integrate the composite score as a drop-in quality signal in PoQ and demonstrate complementary benefits with robust aggregation and adaptive trust weighting under adversarial evaluator attacks.

A Multi-Dimensional Quality Scoring Framework for Decentralized LLM Inference with Proof of Quality

TL;DR

A multi-dimensional quality scoring framework that decomposes output quality into modular dimensions, including model and cost priors, structure quality, semantic quality, query-output alignment, and agreement/uncertainty is proposed.

Abstract

Decentralized large language model (LLM) inference networks can pool heterogeneous compute to scale serving, but they require lightweight and incentive-compatible mechanisms to assess output quality. Prior work introduced cost-aware Proof of Quality (PoQ) and adaptive robust PoQ to allocate rewards under evaluator heterogeneity and adversarial behavior. In this paper, we focus on the quality signal itself and propose a multi-dimensional quality scoring framework that decomposes output quality into modular dimensions, including model and cost priors, structure quality, semantic quality, query-output alignment, and agreement/uncertainty. Using logged outputs from QA and summarization tasks, we systematically audit dimension reliability and show that seemingly reasonable dimensions can be task-dependent and even negatively correlated with reference quality without calibration. While the default composite underperforms a strong single semantic evaluator, ablations reveal that removing unreliable dimensions and re-normalizing weights yields a calibrated composite that matches or exceeds the best single- evaluator and consensus baselines. Finally, we integrate the composite score as a drop-in quality signal in PoQ and demonstrate complementary benefits with robust aggregation and adaptive trust weighting under adversarial evaluator attacks.
Paper Structure (71 sections, 14 figures, 9 tables)

This paper contains 71 sections, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Overview of the proposed multi-dimensional quality scoring framework and its integration into Proof of Quality (PoQ) for decentralized LLM inference. Candidate outputs are scored by multiple dimension modules and combined into a composite quality signal that can be used for consensus and rewards.
  • Figure 2: Modular architecture of multi-dimensional quality scoring. Each dimension module produces a normalized score; the composite score $\hat{s}(q,y)$ is then used as a PoQ-compatible quality signal for aggregation and incentives.
  • Figure 3: Unified correlation summary across individual evaluators, consensus methods, and the default composite score.
  • Figure 4: Correlation heatmap (GT) across evaluators, consensus baselines, composite, and dimensions.
  • Figure 5: Per-dimension correlation with GT. Semantic quality is strongly aligned overall, while alignment and agreement dimensions can be negatively correlated without calibration.
  • ...and 9 more figures