Table of Contents
Fetching ...

SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems -- A Case Study on Ethereum Clients

Masato Kamba, Akiyoshi Sannai

TL;DR

SPECA addresses semantic blind spots in multi-implementation security audits by anchoring checks to natural-language specifications through an artifact-centric, two-phase process that produces reusable checklists and maps them to code across implementations. It enables cross-implementation reuse ($1\rightarrow N$) to scale auditing and reduces false positives by explicitly formalizing threat models. In-the-wild evaluation on the Ethereum Fusaka upgrade shows cross-implementation checks dominate valid findings (about 76.5%), while threat-model misalignment drives most false positives; preliminary V2 improvements demonstrate some recall gains for high-impact issues, though stateful, implementation-specific bugs remain challenging. The approach promises to scale expert auditing efforts and reduce semantic blind spots, particularly when threat boundaries and scope are clearly formalized beforehand.

Abstract

Multi-implementation systems are increasingly audited against natural-language specifications. Differential testing scales well when implementations disagree, but it provides little signal when all implementations converge on the same incorrect interpretation of an ambiguous requirement. We present SPECA, a Specification-to-Checklist Auditing framework that turns normative requirements into checklists, maps them to implementation locations, and supports cross-implementation reuse. We instantiate SPECA in an in-the-wild security audit contest for the Ethereum Fusaka upgrade, covering 11 production clients. Across 54 submissions, 17 were judged valid by the contest organizers. Cross-implementation checks account for 76.5 percent (13 of 17) of valid findings, suggesting that checklist-derived one-to-many reuse is a practical scaling mechanism in multi-implementation audits. To understand false positives, we manually coded the 37 invalid submissions and find that threat model misalignment explains 56.8 percent (21 of 37): reports that rely on assumptions about trust boundaries or scope that contradict the audit's rules. We detected no High or Medium findings in the V1 deployment; misses concentrated in specification details and implicit assumptions (57.1 percent), timing and concurrency issues (28.6 percent), and external library dependencies (14.3 percent). Our improved agent, evaluated against the ground truth of a competitive audit, achieved a strict recall of 27.3 percent on high-impact vulnerabilities, placing it in the top 4 percent of human auditors and outperforming 49 of 51 contestants on critical issues. These results, though from a single deployment, suggest that early, explicit threat modeling is essential for reducing false positives and focusing agentic auditing effort. The agent-driven process enables expert validation and submission in about 40 minutes on average.

SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems -- A Case Study on Ethereum Clients

TL;DR

SPECA addresses semantic blind spots in multi-implementation security audits by anchoring checks to natural-language specifications through an artifact-centric, two-phase process that produces reusable checklists and maps them to code across implementations. It enables cross-implementation reuse () to scale auditing and reduces false positives by explicitly formalizing threat models. In-the-wild evaluation on the Ethereum Fusaka upgrade shows cross-implementation checks dominate valid findings (about 76.5%), while threat-model misalignment drives most false positives; preliminary V2 improvements demonstrate some recall gains for high-impact issues, though stateful, implementation-specific bugs remain challenging. The approach promises to scale expert auditing efforts and reduce semantic blind spots, particularly when threat boundaries and scope are clearly formalized beforehand.

Abstract

Multi-implementation systems are increasingly audited against natural-language specifications. Differential testing scales well when implementations disagree, but it provides little signal when all implementations converge on the same incorrect interpretation of an ambiguous requirement. We present SPECA, a Specification-to-Checklist Auditing framework that turns normative requirements into checklists, maps them to implementation locations, and supports cross-implementation reuse. We instantiate SPECA in an in-the-wild security audit contest for the Ethereum Fusaka upgrade, covering 11 production clients. Across 54 submissions, 17 were judged valid by the contest organizers. Cross-implementation checks account for 76.5 percent (13 of 17) of valid findings, suggesting that checklist-derived one-to-many reuse is a practical scaling mechanism in multi-implementation audits. To understand false positives, we manually coded the 37 invalid submissions and find that threat model misalignment explains 56.8 percent (21 of 37): reports that rely on assumptions about trust boundaries or scope that contradict the audit's rules. We detected no High or Medium findings in the V1 deployment; misses concentrated in specification details and implicit assumptions (57.1 percent), timing and concurrency issues (28.6 percent), and external library dependencies (14.3 percent). Our improved agent, evaluated against the ground truth of a competitive audit, achieved a strict recall of 27.3 percent on high-impact vulnerabilities, placing it in the top 4 percent of human auditors and outperforming 49 of 51 contestants on critical issues. These results, though from a single deployment, suggest that early, explicit threat modeling is essential for reducing false positives and focusing agentic auditing effort. The agent-driven process enables expert validation and submission in about 40 minutes on average.
Paper Structure (34 sections, 7 figures, 13 tables)

This paper contains 34 sections, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Fig A: Checklist construction. Phase 1 extracts structured knowledge from specifications and produces a reusable checklist.
  • Figure 2: Fig B: Strategy B cross-implementation checks. A checklist item identified on Client A is propagated across other clients, accounting for 76.5% of valid findings in our deployment (§\ref{['sec:evaluation']}).
  • Figure 3: SPECA V2 Agent Overview.
  • Figure 4: Program graph creation prompt example.
  • Figure 5: Example program graph generated from an EIP specification (cell proof computation).
  • ...and 2 more figures