Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations
Jasmina Gajcin, Erik Miehling, Rahul Nair, Elizabeth Daly, Radu Marinescu, Seshu Tirupathi
TL;DR
The paper tackles opacity in LLM-as-a-Judge decisions by introducing CLoVE for generating verifiable local, contrastive concepts and GloVE for distilling these into a global, faithful policy. GloVE represents local explanations as a $K$-partite graph, then iteratively clusters, labels, and verifies concepts with a FactReasoner to produce concise, rule-based global explanations. Across seven harm-detection datasets, GloVE demonstrates high fidelity to the original LLM decisions and robust performance under paraphrasing and basic adversarial attacks, with a user study indicating modest gains in perceived usefulness. This work advances transparent, interpretable governance of LLM-based judgments and has practical implications for safer, more accountable deployments.
Abstract
Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.
