Table of Contents
Fetching ...

FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight

Jiayi Zhou, Yang Sheng, Hantao Lou, Yaodong Yang, Jie Fu

TL;DR

FormalJudge reframes agent oversight as formal verification rather than probabilistic judgment, addressing the unreliability of supervising probabilistic systems with probabilistic monitors. It introduces a bidirectional, neuro-symbolic pipeline that uses LLMs as specification compilers to decompose high-level intent into atomic, verifiable constraints, which are then proven correct by Dafny specifications and SMT solving with Z3, yielding mathematical guarantees. Across three behavioral safety and deception-focused benchmarks, FormalJudge achieves substantial improvements over LLM-based judges and demonstrates strong weak-to-strong generalization and iterative safety refinement via formal feedback. The approach reduces reliance on neural reasoning for the global verdict by confining it to atomic fact extraction and leverages deterministic verification for soundness, enhancing robustness against persuasive manipulation by sophisticated agents. The work substantiates a practical pathway to scalable, provable oversight in high-stakes AI systems and provides open-source resources for reproducibility and extension.

Abstract

As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.

FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight

TL;DR

FormalJudge reframes agent oversight as formal verification rather than probabilistic judgment, addressing the unreliability of supervising probabilistic systems with probabilistic monitors. It introduces a bidirectional, neuro-symbolic pipeline that uses LLMs as specification compilers to decompose high-level intent into atomic, verifiable constraints, which are then proven correct by Dafny specifications and SMT solving with Z3, yielding mathematical guarantees. Across three behavioral safety and deception-focused benchmarks, FormalJudge achieves substantial improvements over LLM-based judges and demonstrates strong weak-to-strong generalization and iterative safety refinement via formal feedback. The approach reduces reliance on neural reasoning for the global verdict by confining it to atomic fact extraction and leverages deterministic verification for soundness, enhancing robustness against persuasive manipulation by sophisticated agents. The work substantiates a practical pathway to scalable, provable oversight in high-stakes AI systems and provides open-source resources for reproducibility and extension.

Abstract

As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.
Paper Structure (48 sections, 16 equations, 7 figures, 12 tables)

This paper contains 48 sections, 16 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Teaser: LLM agents face scalable oversight challenges: it is difficult to identify a reliable oversight agent. While the LLM-as-a-Judge baseline relies on probabilistic Chain-of-Thought reasoning, we introduces a Formal-of-Thought architecture that leverages LLMs as specification compilers. It decomposes agent trajectories into atomic facts verified by SMT solvers to provide mathematical proofs of correctness rather than subjective scores. By separating neural semantic extraction from deterministic logical verification, the system ensures oversight remains immune to persuasive manipulation.
  • Figure 2: The neuro-symbolic architecture and verification pipeline of FormalJudge.Panel (a) outlines the oversight workflow where an LLM compiles user intent into Dafny specifications and extracts atomic facts, enabling a Z3 SMT solver to provide deterministic proofs of correctness independent of neural reasoning. Panel (b) details the bidirectional Formal-of-Thought process, illustrating how high-level instructions are decomposed into grounded atomic queries and subsequently translated into logical predicates to mathematically verify agent trajectories against specific constraints.
  • Figure 3: Toy Example: Instantiation of FormalJudge on a trip booking task. The agent satisfies the budget but violates the conditional date rule: "if flying, hotel must start on arrival day." This nested constraint ($f_1 \Rightarrow f_5 \geq f_4$) requires formal logic to verify, whereas an LLM judge might overlook the one-day mismatch.
  • Figure 4: Detection rate across attack sophistication levels (L1--L4). Top row: closed-source judges (Gemini-2.5-Pro, GPT-5.2, Claude-4.5-Sonnet, Grok-3); bottom row: open-source Qwen-2.5 family (7B--72B). FormalJudge maintains stable high detection across all levels, while LLM baselines exhibit high variance: particularly on open-source judges, where detection can drop below 25%.
  • Figure 5: Weak-to-strong deception detection. (a) Detection accuracy vs. agent model size: LLM-based judges degrade from 78--85% on 7B agents to 62--71% on 72B agents, while FormalJudge maintains 91--96% accuracy. Shaded regions indicate variance across judge sizes. (b) Accuracy improvement of FormalJudge over the best baseline for each judge-agent pair. (c) Grouped comparison showing FormalJudge is the only method enabling reliable weak-to-strong oversight.
  • ...and 2 more figures