A logical alarm for misaligned binary classifiers

Andrés Corrada-Emmanuel; Ilya Parker; Ramesh Bharadwaj

A logical alarm for misaligned binary classifiers

Andrés Corrada-Emmanuel, Ilya Parker, Ramesh Bharadwaj

TL;DR

This work addresses unsupervised evaluation of binary classifier ensembles by deriving a complete algebraic axiomatization for evaluation outcomes. It introduces a polynomial-generating framework and proves a single-classifier axiom for $N=1$ and a pair-axiom for $N=2$, enabling a geometrical, label-space representation of logically consistent evaluations. By enforcing a fixed safety specification and scanning all possible $Q_a$ values, it constructs a purely logical misalignment alarm that detects when at least one member is misbehaving without ground-truth labels. The approach connects formal verification techniques to AI safety, provides exact, parameter-free insights under error-independence, and discusses practical use, limitations, and societal implications for trustable supervision of noisy AI systems.

Abstract

If two agents disagree in their decisions, we may suspect they are not both correct. This intuition is formalized for evaluating agents that have carried out a binary classification task. Their agreements and disagreements on a joint test allow us to establish the only group evaluations logically consistent with their responses. This is done by establishing a set of axioms (algebraic relations) that must be universally obeyed by all evaluations of binary responders. A complete set of such axioms are possible for each ensemble of size N. The axioms for $N = 1, 2$ are used to construct a fully logical alarm - one that can prove that at least one ensemble member is malfunctioning using only unlabeled data. The similarities of this approach to formal software verification and its utility for recent agendas of safe guaranteed AI are discussed.

A logical alarm for misaligned binary classifiers

TL;DR

and a pair-axiom for

, enabling a geometrical, label-space representation of logically consistent evaluations. By enforcing a fixed safety specification and scanning all possible

values, it constructs a purely logical misalignment alarm that detects when at least one member is misbehaving without ground-truth labels. The approach connects formal verification techniques to AI safety, provides exact, parameter-free insights under error-independence, and discusses practical use, limitations, and societal implications for trustable supervision of noisy AI systems.

Abstract

are used to construct a fully logical alarm - one that can prove that at least one ensemble member is malfunctioning using only unlabeled data. The similarities of this approach to formal software verification and its utility for recent agendas of safe guaranteed AI are discussed.

Paper Structure (23 sections, 34 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 34 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
A verification formalism for binary evaluations
An evaluation model for the trivial ensemble, $N=1$
The N=2 construction
Verifying at least one classifier is malfunctioning
The rectangle of logically consistent label evaluations at fixed $Q_a$
Testing at all possible values of $Q_a$
Discussion and previous work
Limitations and societal impacts
Appendix / supplemental material
Definitions and the set of all possible evaluations for tests of size $Q$
$\mathcal{R}\text{-space}$ definitions
$\mathcal{P}\text{-space}$ definitions
The set of all possible single binary classifier evaluations of size $Q$
The evaluation ideals in $\mathcal{R}\text{-space}$ and $\mathcal{P}\text{-space}$
...and 8 more sections

Figures (5)

Figure 1: The set of all possible evaluations in ($R_{a_i,a}$,$R_{b_i,b}$, $Q_a$) space for a $Q=281$ test for three LLMs grading a fourth one on the multistep-arithmetic task in the BIG-Bench-Mistake Dataset. Once we observe how the test was answered, we can use the single classifier axiom: $(Q R_{a_i,a}\xspace - Q_a\xspace R_{a_i}) - (Q R_{b_i,b}\xspace - Q_b\xspace R_{b_i}).$ It defines a much smaller set of evaluations consistent with the observed test responses. Somewhere in this set is the unknown ground truth value for the number of correct responses in each label, $R_{a_i,a}$ and $R_{b_i,b}$. Once we know the number of responses for a classifier, $R_{a_i}$ and $R_{b_i}$, the axiom defines a plane as pictured here for the three grading LLMs.
Figure 2: All the possible evaluations for a pair of binary classifiers for a $Q=100$ test assuming $Q_a\xspace=60.$ The pair is working correctly for a label when its group evaluation lies on the green circles. Conversely, it is malfunctioning if it lies on the red circles. The computation of the possible evaluations was done for the hypothetical case of $R_{a_1}=60$ and $R_{a_2}=20.$ For such a test result, there are no group evaluations for label $\mathcal{A}$ that satisfy the safety specification of being better than 50% on a label.
Figure 3: All the possible single classifier test summaries for a $Q=20$ test. The left figure shows them in $\mathcal{R}\text{-space}$, the right one in $\mathcal{P}\text{-space}$. Note the difference in the geometry of each set.
Figure 4: Logical traces for the misalignment alarm based on the single classifier axiom for binary classifiers. Three LLMs graded a fourth one completing the multistep-arithmetic task in the BIG-Bench-Mistake dataset. One pair, the Mistral-Large and GPT4-Turbo LLMs, disagreed enough to violate the safety specification at all assumed values $Q_a$.
Figure 5: Logical trace for the misalignment alarm when the evaluation sketch in Table \ref{['tbl:grading-llms']} is transformed to make all the LLMs satisfy the safety specification. All three are being compared simultaneously at any fixed $Q_a$ so the single classifier axiom defines a cuboid in the 3-dimensional space for each label.

A logical alarm for misaligned binary classifiers

TL;DR

Abstract

A logical alarm for misaligned binary classifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)