Agentified Assessment of Logical Reasoning Agents

Zhiyu Ni; Yifeng Xiao; Zheng Liang

Agentified Assessment of Logical Reasoning Agents

Zhiyu Ni, Yifeng Xiao, Zheng Liang

TL;DR

This work uses an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface.

Abstract

We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the cleaned FOLIO validation set, the auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%).

Agentified Assessment of Logical Reasoning Agents

TL;DR

Abstract

Paper Structure (13 sections, 1 equation, 2 figures, 1 table)

This paper contains 13 sections, 1 equation, 2 figures, 1 table.

Introduction
Benchmark and Data Cleaning
Verification and Repair Pipeline
Benchmark Statistics
Agentified Evaluation Framework
Agentified Assessment
Assessor Agent (Evaluation Protocol)
Reasoning Agents Under Test
Implementation note.
Experiments
Experimental Setup and Baseline
Results and Analysis
Conclusion

Figures (2)

Figure 1: Overview of the data cleaning pipeline.
Figure 2: Traditional evaluation harnesses couple task execution, environments, and judging logic to a preset harness. In agentified assessment, an assessor agent evaluates an agent under test via an A2A interface a2a2026spec, reducing integration overhead.

Agentified Assessment of Logical Reasoning Agents

TL;DR

Abstract

Agentified Assessment of Logical Reasoning Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (2)