Patient-Zero: Scaling Synthetic Patient Agents to Real-World Distributions without Real Patient Data

Yunghwei Lai; Ziyue Wang; Weizhi Ma; Yang Liu

Patient-Zero: Scaling Synthetic Patient Agents to Real-World Distributions without Real Patient Data

Yunghwei Lai, Ziyue Wang, Weizhi Ma, Yang Liu

TL;DR

Patient-Zero addresses privacy and distribution biases in medical data by generating synthetic patient records ab initio from abstract clinical knowledge. It couples a four-stage Medically-Aligned Hierarchical Synthesis pipeline with a Dual-Track Cognitive Memory system and an NLI-Verifier to produce both static records and interactive agents that preserve clinical consistency. Extensive evaluations show near-indistinguishability from real data in expert assessments, superior clinical quality, and substantial downstream gains on MedQA (+24.0%) and MMLU (+14.5%) when trained on the synthetic corpus. The distribution analysis confirms real-world epidemiological alignment, supporting privacy-preserving, scalable data generation for clinical AI research and training.

Abstract

Synthetic data generation with Large Language Models (LLMs) has emerged as a promising solution in the medical domain to mitigate data scarcity and privacy constraints. However, existing approaches remain constrained by their derivative nature, relying on real-world records, which pose privacy risks and distribution biases. Furthermore, current patient agents face the Stability-Plasticity Dilemma, struggling to maintain clinical consistency during dynamic inquiries. To address these challenges, we introduce Patient-Zero, a novel framework for ab initio patient simulation that requires no real medical records. Our Medically-Aligned Hierarchical Synthesis framework generates comprehensive and diverse patient records from abstract clinical guidelines via stratified attribute permutation. To support rigorous clinical interaction, we design a Dual-Track Cognitive Memory System to enable agents dynamically update memory while preserving logical consistency and persona adherence. Extensive evaluations show that Patient-Zero establishes a new state-of-the-art in both data quality and interaction fidelity. In human expert evaluations, senior licensed physicians judge our synthetic data to be statistically indistinguishable from real human-authored data and higher in clinical quality. Furthermore, downstream medical reasoning model trained on our synthetic dataset shows substantial performance gains (MedQA +24.0%; MMLU +14.5%), demonstrating the practical utility of our framework.

Patient-Zero: Scaling Synthetic Patient Agents to Real-World Distributions without Real Patient Data

TL;DR

Abstract

Patient-Zero: Scaling Synthetic Patient Agents to Real-World Distributions without Real Patient Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)

Theorems & Definitions (1)