Table of Contents
Fetching ...

A logical alarm for misaligned binary classifiers

Andrés Corrada-Emmanuel, Ilya Parker, Ramesh Bharadwaj

TL;DR

This work addresses unsupervised evaluation of binary classifier ensembles by deriving a complete algebraic axiomatization for evaluation outcomes. It introduces a polynomial-generating framework and proves a single-classifier axiom for $N=1$ and a pair-axiom for $N=2$, enabling a geometrical, label-space representation of logically consistent evaluations. By enforcing a fixed safety specification and scanning all possible $Q_a$ values, it constructs a purely logical misalignment alarm that detects when at least one member is misbehaving without ground-truth labels. The approach connects formal verification techniques to AI safety, provides exact, parameter-free insights under error-independence, and discusses practical use, limitations, and societal implications for trustable supervision of noisy AI systems.

Abstract

If two agents disagree in their decisions, we may suspect they are not both correct. This intuition is formalized for evaluating agents that have carried out a binary classification task. Their agreements and disagreements on a joint test allow us to establish the only group evaluations logically consistent with their responses. This is done by establishing a set of axioms (algebraic relations) that must be universally obeyed by all evaluations of binary responders. A complete set of such axioms are possible for each ensemble of size N. The axioms for $N = 1, 2$ are used to construct a fully logical alarm - one that can prove that at least one ensemble member is malfunctioning using only unlabeled data. The similarities of this approach to formal software verification and its utility for recent agendas of safe guaranteed AI are discussed.

A logical alarm for misaligned binary classifiers

TL;DR

This work addresses unsupervised evaluation of binary classifier ensembles by deriving a complete algebraic axiomatization for evaluation outcomes. It introduces a polynomial-generating framework and proves a single-classifier axiom for and a pair-axiom for , enabling a geometrical, label-space representation of logically consistent evaluations. By enforcing a fixed safety specification and scanning all possible values, it constructs a purely logical misalignment alarm that detects when at least one member is misbehaving without ground-truth labels. The approach connects formal verification techniques to AI safety, provides exact, parameter-free insights under error-independence, and discusses practical use, limitations, and societal implications for trustable supervision of noisy AI systems.

Abstract

If two agents disagree in their decisions, we may suspect they are not both correct. This intuition is formalized for evaluating agents that have carried out a binary classification task. Their agreements and disagreements on a joint test allow us to establish the only group evaluations logically consistent with their responses. This is done by establishing a set of axioms (algebraic relations) that must be universally obeyed by all evaluations of binary responders. A complete set of such axioms are possible for each ensemble of size N. The axioms for are used to construct a fully logical alarm - one that can prove that at least one ensemble member is malfunctioning using only unlabeled data. The similarities of this approach to formal software verification and its utility for recent agendas of safe guaranteed AI are discussed.
Paper Structure (23 sections, 34 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 34 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: The set of all possible evaluations in ($R_{a_i,a}$,$R_{b_i,b}$, $Q_a$) space for a $Q=281$ test for three LLMs grading a fourth one on the multistep-arithmetic task in the BIG-Bench-Mistake Dataset. Once we observe how the test was answered, we can use the single classifier axiom: $(Q R_{a_i,a}\xspace - Q_a\xspace R_{a_i}) - (Q R_{b_i,b}\xspace - Q_b\xspace R_{b_i}).$ It defines a much smaller set of evaluations consistent with the observed test responses. Somewhere in this set is the unknown ground truth value for the number of correct responses in each label, $R_{a_i,a}$ and $R_{b_i,b}$. Once we know the number of responses for a classifier, $R_{a_i}$ and $R_{b_i}$, the axiom defines a plane as pictured here for the three grading LLMs.
  • Figure 2: All the possible evaluations for a pair of binary classifiers for a $Q=100$ test assuming $Q_a\xspace=60.$ The pair is working correctly for a label when its group evaluation lies on the green circles. Conversely, it is malfunctioning if it lies on the red circles. The computation of the possible evaluations was done for the hypothetical case of $R_{a_1}=60$ and $R_{a_2}=20.$ For such a test result, there are no group evaluations for label $\mathcal{A}$ that satisfy the safety specification of being better than 50% on a label.
  • Figure 3: All the possible single classifier test summaries for a $Q=20$ test. The left figure shows them in $\mathcal{R}\text{-space}$, the right one in $\mathcal{P}\text{-space}$. Note the difference in the geometry of each set.
  • Figure 4: Logical traces for the misalignment alarm based on the single classifier axiom for binary classifiers. Three LLMs graded a fourth one completing the multistep-arithmetic task in the BIG-Bench-Mistake dataset. One pair, the Mistral-Large and GPT4-Turbo LLMs, disagreed enough to violate the safety specification at all assumed values $Q_a$.
  • Figure 5: Logical trace for the misalignment alarm when the evaluation sketch in Table \ref{['tbl:grading-llms']} is transformed to make all the LLMs satisfy the safety specification. All three are being compared simultaneously at any fixed $Q_a$ so the single classifier axiom defines a cuboid in the 3-dimensional space for each label.