Table of Contents
Fetching ...

Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

Arya Fayyazi, Haleh Akrami

TL;DR

Proof-of-Perception is presented, a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees that improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.

Abstract

We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set, yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget, expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy-compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.

Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

TL;DR

Proof-of-Perception is presented, a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees that improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.

Abstract

We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set, yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget, expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy-compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.
Paper Structure (43 sections, 22 equations, 3 figures, 3 tables)

This paper contains 43 sections, 22 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the Proof-of-Perception (PoP) workflow. Each perception or reasoning operation is represented as a node in a directed acyclic graph (DAG). Each node is equipped with a conformal prediction head that provides calibrated uncertainty sets $\Gamma^{(t)}_\delta(x)$, and a controller adaptively allocates computation based on these certificates, producing reliable and explainable outputs.
  • Figure 2: Node-wise conformal coverage vs. average set size across pooled datasets (target $90\%$). Bars show mean coverage; thin caps indicate $\pm1.0\%$ variation. We constrain set sizes (OCR: up to 5, boxes: up to 3), which keeps coverage near target without inflating candidate sets.
  • Figure 3: Accuracy–compute frontiers. PoP attains higher accuracy for a given budget and avoids over-expansion once node certificates meet the target. Shaded bands indicate $\pm 0.4$ absolute variation across three seeds.