Haibu Mathematical-Medical Intelligent Agent:Enhancing Large Language Model Reliability in Medical Tasks via Verifiable Reasoning Chains
Yilun Zhang, Dexing Kong
TL;DR
This work tackles the unreliability of medical LLMs by introducing the Haibu MMIA, a verifiable reasoning framework that enforces a plan-execute-verify loop to produce auditable reasoning chains treated as formal theorems. The architecture couples an axiom/theorem knowledge base with Retrieval-Augmented Generation and an independent auditor to guarantee logical coherence, evidence traceability, and reasoning soundness, while storing validated proofs to accelerate future tasks. Across four healthcare administration scenarios—DRG/DIP auditing, medical device regulatory compliance, real-time EHR quality control, and complex insurance adjudication—MMIA achieves error-detection rates above 98% with false-positive rates below 1%, outperforming baseline LLMs. In addition, the mature knowledge base enables a substantial cost reduction (approximately 85% on average) via RAG matching, indicating strong potential for scalable, trustworthy, and cost-effective AI in medicine.
Abstract
Large Language Models (LLMs) show promise in medicine but are prone to factual and logical errors, which is unacceptable in this high-stakes field. To address this, we introduce the "Haibu Mathematical-Medical Intelligent Agent" (MMIA), an LLM-driven architecture that ensures reliability through a formally verifiable reasoning process. MMIA recursively breaks down complex medical tasks into atomic, evidence-based steps. This entire reasoning chain is then automatically audited for logical coherence and evidence traceability, similar to theorem proving. A key innovation is MMIA's "bootstrapping" mode, which stores validated reasoning chains as "theorems." Subsequent tasks can then be efficiently solved using Retrieval-Augmented Generation (RAG), shifting from costly first-principles reasoning to a low-cost verification model. We validated MMIA across four healthcare administration domains, including DRG/DIP audits and medical insurance adjudication, using expert-validated benchmarks. Results showed MMIA achieved an error detection rate exceeding 98% with a false positive rate below 1%, significantly outperforming baseline LLMs. Furthermore, the RAG matching mode is projected to reduce average processing costs by approximately 85% as the knowledge base matures. In conclusion, MMIA's verifiable reasoning framework is a significant step toward creating trustworthy, transparent, and cost-effective AI systems, making LLM technology viable for critical applications in medicine.
