Table of Contents
Fetching ...

Verifiable Semantics for Agent-to-Agent Communication

Philipp Schoenegger, Matt Carlson, Chris Schneider, Chris Daly

TL;DR

A certification protocol based on the stimulus-meaning model, where agents are tested on shared observable events and terms are certified if empirical disagreement falls below a statistical threshold, provides a first step towards verifiable agent-to-agent communication.

Abstract

Multiagent AI systems require consistent communication, but we lack methods to verify that agents share the same understanding of the terms used. Natural language is interpretable but vulnerable to semantic drift, while learned protocols are efficient but opaque. We propose a certification protocol based on the stimulus-meaning model, where agents are tested on shared observable events and terms are certified if empirical disagreement falls below a statistical threshold. In this protocol, agents restricting their reasoning to certified terms ("core-guarded reasoning") achieve provably bounded disagreement. We also outline mechanisms for detecting drift (recertification) and recovering shared vocabulary (renegotiation). In simulations with varying degrees of semantic divergence, core-guarding reduces disagreement by 72-96%. In a validation with fine-tuned language models, disagreement is reduced by 51%. Our framework provides a first step towards verifiable agent-to-agent communication.

Verifiable Semantics for Agent-to-Agent Communication

TL;DR

A certification protocol based on the stimulus-meaning model, where agents are tested on shared observable events and terms are certified if empirical disagreement falls below a statistical threshold, provides a first step towards verifiable agent-to-agent communication.

Abstract

Multiagent AI systems require consistent communication, but we lack methods to verify that agents share the same understanding of the terms used. Natural language is interpretable but vulnerable to semantic drift, while learned protocols are efficient but opaque. We propose a certification protocol based on the stimulus-meaning model, where agents are tested on shared observable events and terms are certified if empirical disagreement falls below a statistical threshold. In this protocol, agents restricting their reasoning to certified terms ("core-guarded reasoning") achieve provably bounded disagreement. We also outline mechanisms for detecting drift (recertification) and recovering shared vocabulary (renegotiation). In simulations with varying degrees of semantic divergence, core-guarding reduces disagreement by 72-96%. In a validation with fine-tuned language models, disagreement is reduced by 51%. Our framework provides a first step towards verifiable agent-to-agent communication.
Paper Structure (15 sections, 2 theorems, 1 equation, 4 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 2 theorems, 1 equation, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

If term $T$ is certified with parameters $(\tau, \delta)$, then the true contradictory-divergence rate $p_T \leq \tau$ with probability $\geq 1-\delta$.

Figures (4)

  • Figure 1: Overview of the stimulus-meaning protocol. (a) Agents are tested on shared events and verdicts are recorded in a public ledger. (b) Terms with contradiction rates below threshold $\tau$ are certified into a core vocabulary $V^*$. (c) Downstream reasoning is restricted to $V^*$, reducing disagreement. Periodic recertification detects drift and renegotiation recovers excluded terms.
  • Figure 2: Disagreement rates and core size across divergence conditions. (A) Unguarded (solid) vs. core-guarded (hatched) disagreement rates. Guarded rates remain at $\sim$2% across all scenarios, while unguarded rates rise with divergence. (B) Core size distribution. As divergence increases, fewer terms certify: Noise-Only averages 3.8 terms, Moderate Drift 2.6, High Divergence 0.2 (95% empty cores). Dashed lines indicate means.
  • Figure 3: Drift, recertification, and renegotiation over 50 epochs. Drift injected at epoch 10. (A) Baseline: stable disagreement. (B) Drift with frozen core: disagreement rises to 4.6%. (C) Recertification: contested term removed, disagreement returns to $\sim$2% but core remains low. (D) Renegotiation: unguarded disagreement drops and vocabulary recovers to 4.0 terms. (E) Certified core size across conditions.
  • Figure 4: Coverage-reliability trade-off as a function of the certification threshold $\tau$. Each curve represents a different alignment regime (fraction $\pi$ of well-aligned terms). Higher $\pi$ shifts the Pareto frontier outward, enabling higher coverage at any given reliability level. (A) Pareto frontier showing coverage vs. guarded disagreement. Dotted lines show unguarded disagreement baselines. (B) Coverage (solid) and disagreement (dashed) as functions of $\tau$ directly.

Theorems & Definitions (7)

  • Definition 1: Event Space
  • Definition 2: Witnessed Test
  • Definition 3: Stimulus Meaning
  • Definition 4: Divergence
  • Theorem 1: Certification Soundness
  • Definition 5: Core-Guarded Reasoning
  • Proposition 1: Reproducibility