Table of Contents
Fetching ...

Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations

Jasmina Gajcin, Erik Miehling, Rahul Nair, Elizabeth Daly, Radu Marinescu, Seshu Tirupathi

TL;DR

The paper tackles opacity in LLM-as-a-Judge decisions by introducing CLoVE for generating verifiable local, contrastive concepts and GloVE for distilling these into a global, faithful policy. GloVE represents local explanations as a $K$-partite graph, then iteratively clusters, labels, and verifies concepts with a FactReasoner to produce concise, rule-based global explanations. Across seven harm-detection datasets, GloVE demonstrates high fidelity to the original LLM decisions and robust performance under paraphrasing and basic adversarial attacks, with a user study indicating modest gains in perceived usefulness. This work advances transparent, interpretable governance of LLM-based judgments and has practical implications for safer, more accountable deployments.

Abstract

Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.

Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations

TL;DR

The paper tackles opacity in LLM-as-a-Judge decisions by introducing CLoVE for generating verifiable local, contrastive concepts and GloVE for distilling these into a global, faithful policy. GloVE represents local explanations as a -partite graph, then iteratively clusters, labels, and verifies concepts with a FactReasoner to produce concise, rule-based global explanations. Across seven harm-detection datasets, GloVE demonstrates high fidelity to the original LLM decisions and robust performance under paraphrasing and basic adversarial attacks, with a user study indicating modest gains in perceived usefulness. This work advances transparent, interpretable governance of LLM-based judgments and has practical implications for safer, more accountable deployments.

Abstract

Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.

Paper Structure

This paper contains 18 sections, 2 theorems, 9 equations, 2 figures, 3 tables, 2 algorithms.

Key Result

Lemma 4.1

The initial graph $G_0$ is a homomorphism of the explanation graph $G_I$: $G_0 \rightarrow G_I$.

Figures (2)

  • Figure 1: An example use of CLoVE algorithm to generate local explanations for explaining why a prompt is classified as harmful by LLM-as-a-Judge $\mathcal{M}$. A generator $\mathcal{G}$ is used to generate initial supporting and conflicting concepts for the decision. A local explainer $\mathcal{L}$ (e.g. LIME) is used to generate a set of words that affected the decision, and a verifier model $\mathcal{V}$ is used to filter out concepts that are not supported by these words. A local explanation is formed in a BECAUSE-DESPITE format using verified supporting and conflicting concepts.
  • Figure 2: GloVE algorithm explaining LLM-as-a-Judge on a binary harm detection task. Graph $G_0$ is generated from the collection of local explanations $\mathcal{E}$ and summarized through $I$ iterations. In each iteration $i$, concepts in $G_i$ are clustered and a set of candidate labels is generated for each cluster. The best label is chosen as the one that entails the largest number of concepts in the cluster, according to FactReasoner algorithm.

Theorems & Definitions (2)

  • Lemma 4.1
  • Lemma 4.2