Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

Arya Fayyazi; Haleh Akrami

Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

Arya Fayyazi, Haleh Akrami

TL;DR

Proof-of-Perception is presented, a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees that improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.

Abstract

We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set, yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget, expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy-compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.

Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

TL;DR

Abstract

Paper Structure (43 sections, 22 equations, 3 figures, 3 tables)

This paper contains 43 sections, 22 equations, 3 figures, 3 tables.

Introduction
Background and Related Work
Multimodal Reasoning and CP
Limitations and Positioning
Methodology
Problem Setup and Notation
Reasoning Graph Representation
Graph generation.
Node-Level Predictions and Nonconformity Scores
Examples of nonconformity.
Split CP for Node Certificates
Calibration data.
Set-valued prediction.
Practical instantiation.
Certificate head.
...and 28 more sections

Figures (3)

Figure 1: Overview of the Proof-of-Perception (PoP) workflow. Each perception or reasoning operation is represented as a node in a directed acyclic graph (DAG). Each node is equipped with a conformal prediction head that provides calibrated uncertainty sets $\Gamma^{(t)}_\delta(x)$, and a controller adaptively allocates computation based on these certificates, producing reliable and explainable outputs.
Figure 2: Node-wise conformal coverage vs. average set size across pooled datasets (target $90\%$). Bars show mean coverage; thin caps indicate $\pm1.0\%$ variation. We constrain set sizes (OCR: up to 5, boxes: up to 3), which keeps coverage near target without inflating candidate sets.
Figure 3: Accuracy–compute frontiers. PoP attains higher accuracy for a given budget and avoids over-expansion once node certificates meet the target. Shaded bands indicate $\pm 0.4$ absolute variation across three seeds.

Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

TL;DR

Abstract

Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

Authors

TL;DR

Abstract

Table of Contents

Figures (3)