Table of Contents
Fetching ...

Logical Consistency Between Disagreeing Experts and Its Role in AI Safety

Andrés Corrada-Emmanuel

TL;DR

The paper tackles evaluating expert judgments without ground truth by formulating a logic of unsupervised evaluation based on logical consistency. It introduces the $Q$-complex (QC) and a set of linear axioms for a single classifier, deriving the feasible space of evaluations and extending to multi-class and multi-expert scenarios to produce no-knowledge alarms. Through MT-Bench experiments, it shows that disagreement patterns constrain possible evaluations and can trigger safety alarms at practical thresholds (e.g., around 46% accuracy) even without ground-truth keys. The approach offers a domain-agnostic safety monitor for LLMs-as-Judges and other classifiers in zero-ground-truth settings, while acknowledging its inherent limits in detecting correct agreements or validating test validity. Overall, the work reframes disagreement as a quantitative signal to bound and monitor classifier performance in unsupervised contexts.

Abstract

If two experts disagree on a test, we may conclude both cannot be 100 per cent correct. But if they completely agree, no possible evaluation can be excluded. This asymmetry in the utility of agreements versus disagreements is explored here by formalizing a logic of unsupervised evaluation for classifiers. Its core problem is computing the set of group evaluations that are logically consistent with how we observe them agreeing and disagreeing in their decisions. Statistical summaries of their aligned decisions are inputs into a Linear Programming problem in the integer space of possible correct or incorrect responses given true labels. Obvious logical constraints, such as, the number of correct responses cannot exceed the number of observed responses, are inequalities. But in addition, there are axioms, universally applicable linear equalities that apply to all finite tests. The practical and immediate utility of this approach to unsupervised evaluation using only logical consistency is demonstrated by building no-knowledge alarms that can detect when one or more LLMs-as-Judges are violating a minimum grading threshold specified by the user.

Logical Consistency Between Disagreeing Experts and Its Role in AI Safety

TL;DR

The paper tackles evaluating expert judgments without ground truth by formulating a logic of unsupervised evaluation based on logical consistency. It introduces the -complex (QC) and a set of linear axioms for a single classifier, deriving the feasible space of evaluations and extending to multi-class and multi-expert scenarios to produce no-knowledge alarms. Through MT-Bench experiments, it shows that disagreement patterns constrain possible evaluations and can trigger safety alarms at practical thresholds (e.g., around 46% accuracy) even without ground-truth keys. The approach offers a domain-agnostic safety monitor for LLMs-as-Judges and other classifiers in zero-ground-truth settings, while acknowledging its inherent limits in detecting correct agreements or validating test validity. Overall, the work reframes disagreement as a quantitative signal to bound and monitor classifier performance in unsupervised contexts.

Abstract

If two experts disagree on a test, we may conclude both cannot be 100 per cent correct. But if they completely agree, no possible evaluation can be excluded. This asymmetry in the utility of agreements versus disagreements is explored here by formalizing a logic of unsupervised evaluation for classifiers. Its core problem is computing the set of group evaluations that are logically consistent with how we observe them agreeing and disagreeing in their decisions. Statistical summaries of their aligned decisions are inputs into a Linear Programming problem in the integer space of possible correct or incorrect responses given true labels. Obvious logical constraints, such as, the number of correct responses cannot exceed the number of observed responses, are inequalities. But in addition, there are axioms, universally applicable linear equalities that apply to all finite tests. The practical and immediate utility of this approach to unsupervised evaluation using only logical consistency is demonstrated by building no-knowledge alarms that can detect when one or more LLMs-as-Judges are violating a minimum grading threshold specified by the user.

Paper Structure

This paper contains 14 sections, 18 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: All possible evaluations for a binary classifier labeling $Q=10$ items. $R_{a_i,a}$ and $R_{b_i,b}$ are the number of correct responses by classifier $i$ for labels 'a' and 'b' respectively.
  • Figure 2: All possible evaluations for a $Q=10$ test but now in the space of prevalence,$P_a=Q_a/Q$, and label accuracies, $P_{a_i,a}=R_{a_i,a}/Q_a$ and $P_{b_i,b}=R_{b_i,b}/(Q-Q_a).$ This is visual proof that the geometry of possible evaluations is easier in the integer response space shown in Fig. \ref{['fig:all-possible']}.
  • Figure 3: All possible evaluations for a $Q=10$ test after we observe the test summary $(R_{a_i}=4, R_{b_i}=6).$ The inequalities change the number of possible evaluations but not the dimension of their geometry.
  • Figure 4: All possible evaluations for a $Q=10$ binary test after we observe the test summary $(R_{a_i}=4, R_{b_i}=6)$ and pick evaluations consistent with it as expressed by either of the binary axioms: $R_{a_i,a}-Q_a + (R_{b_i}=6) - R_{b_i,b},$ or $R_{b_i,b}-Q_b + (R_{a_i}=4) - R_{a_i,a}.$
  • Figure 5: All possible evaluations for a $Q=10$ binary test assuming the answer key summary is $(Q_a=6, Q_b=4)$ and we have observed the test summaries $(R_{a_i}=4, R_{b_i}=6)$ and $(R_{a_j}=7, R_{b_j}=3)$ for classifiers $i$ and $j.$
  • ...and 3 more figures