FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight
Jiayi Zhou, Yang Sheng, Hantao Lou, Yaodong Yang, Jie Fu
TL;DR
FormalJudge reframes agent oversight as formal verification rather than probabilistic judgment, addressing the unreliability of supervising probabilistic systems with probabilistic monitors. It introduces a bidirectional, neuro-symbolic pipeline that uses LLMs as specification compilers to decompose high-level intent into atomic, verifiable constraints, which are then proven correct by Dafny specifications and SMT solving with Z3, yielding mathematical guarantees. Across three behavioral safety and deception-focused benchmarks, FormalJudge achieves substantial improvements over LLM-based judges and demonstrates strong weak-to-strong generalization and iterative safety refinement via formal feedback. The approach reduces reliance on neural reasoning for the global verdict by confining it to atomic fact extraction and leverages deterministic verification for soundness, enhancing robustness against persuasive manipulation by sophisticated agents. The work substantiates a practical pathway to scalable, provable oversight in high-stakes AI systems and provides open-source resources for reproducibility and extension.
Abstract
As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.
