Table of Contents
Fetching ...

Patient-Zero: Scaling Synthetic Patient Agents to Real-World Distributions without Real Patient Data

Yunghwei Lai, Ziyue Wang, Weizhi Ma, Yang Liu

TL;DR

Patient-Zero addresses privacy and distribution biases in medical data by generating synthetic patient records ab initio from abstract clinical knowledge. It couples a four-stage Medically-Aligned Hierarchical Synthesis pipeline with a Dual-Track Cognitive Memory system and an NLI-Verifier to produce both static records and interactive agents that preserve clinical consistency. Extensive evaluations show near-indistinguishability from real data in expert assessments, superior clinical quality, and substantial downstream gains on MedQA (+24.0%) and MMLU (+14.5%) when trained on the synthetic corpus. The distribution analysis confirms real-world epidemiological alignment, supporting privacy-preserving, scalable data generation for clinical AI research and training.

Abstract

Synthetic data generation with Large Language Models (LLMs) has emerged as a promising solution in the medical domain to mitigate data scarcity and privacy constraints. However, existing approaches remain constrained by their derivative nature, relying on real-world records, which pose privacy risks and distribution biases. Furthermore, current patient agents face the Stability-Plasticity Dilemma, struggling to maintain clinical consistency during dynamic inquiries. To address these challenges, we introduce Patient-Zero, a novel framework for ab initio patient simulation that requires no real medical records. Our Medically-Aligned Hierarchical Synthesis framework generates comprehensive and diverse patient records from abstract clinical guidelines via stratified attribute permutation. To support rigorous clinical interaction, we design a Dual-Track Cognitive Memory System to enable agents dynamically update memory while preserving logical consistency and persona adherence. Extensive evaluations show that Patient-Zero establishes a new state-of-the-art in both data quality and interaction fidelity. In human expert evaluations, senior licensed physicians judge our synthetic data to be statistically indistinguishable from real human-authored data and higher in clinical quality. Furthermore, downstream medical reasoning model trained on our synthetic dataset shows substantial performance gains (MedQA +24.0%; MMLU +14.5%), demonstrating the practical utility of our framework.

Patient-Zero: Scaling Synthetic Patient Agents to Real-World Distributions without Real Patient Data

TL;DR

Patient-Zero addresses privacy and distribution biases in medical data by generating synthetic patient records ab initio from abstract clinical knowledge. It couples a four-stage Medically-Aligned Hierarchical Synthesis pipeline with a Dual-Track Cognitive Memory system and an NLI-Verifier to produce both static records and interactive agents that preserve clinical consistency. Extensive evaluations show near-indistinguishability from real data in expert assessments, superior clinical quality, and substantial downstream gains on MedQA (+24.0%) and MMLU (+14.5%) when trained on the synthetic corpus. The distribution analysis confirms real-world epidemiological alignment, supporting privacy-preserving, scalable data generation for clinical AI research and training.

Abstract

Synthetic data generation with Large Language Models (LLMs) has emerged as a promising solution in the medical domain to mitigate data scarcity and privacy constraints. However, existing approaches remain constrained by their derivative nature, relying on real-world records, which pose privacy risks and distribution biases. Furthermore, current patient agents face the Stability-Plasticity Dilemma, struggling to maintain clinical consistency during dynamic inquiries. To address these challenges, we introduce Patient-Zero, a novel framework for ab initio patient simulation that requires no real medical records. Our Medically-Aligned Hierarchical Synthesis framework generates comprehensive and diverse patient records from abstract clinical guidelines via stratified attribute permutation. To support rigorous clinical interaction, we design a Dual-Track Cognitive Memory System to enable agents dynamically update memory while preserving logical consistency and persona adherence. Extensive evaluations show that Patient-Zero establishes a new state-of-the-art in both data quality and interaction fidelity. In human expert evaluations, senior licensed physicians judge our synthetic data to be statistically indistinguishable from real human-authored data and higher in clinical quality. Furthermore, downstream medical reasoning model trained on our synthetic dataset shows substantial performance gains (MedQA +24.0%; MMLU +14.5%), demonstrating the practical utility of our framework.

Paper Structure

This paper contains 69 sections, 5 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: Our Patient-Zero Paradigm. While conventional methods are constrained by the derivative nature of real-world data, such as privacy risks and distribution biases, our framework enables ab initio patient simulation. Instead of using sensitive medical records as seed, Patient-Zero constructs patient agents from scratch using medical knowledge, achieving zero privacy risk while maintaining clinical consistency throughout synthetic data generation and interactive simulation.
  • Figure 2: Overview of our Patient-Zero Hierarchical Synthesis Framework. The pipeline factorizes patient generation into a four-stage causal chain ($c \to \mathcal{O} \to \mathcal{B} \to \mathcal{S} \to \mathcal{E}$). Starting from an abstract disease concept ($c$), the system progressively expands medical details: I) Standardizing noisy knowledge into a structured outline; II) Sampling epidemiological attributes via constrained permutation; III) Evolving dynamic symptom trajectories; and IV) Generating granular, quantitative examination results. A global verify-and-regenerate mechanism (bottom bar) enforces strict validity via iterative self-correction at every stage to prevent error propagation down the causal chain.
  • Figure 3: The Dual-Track Cognitive Memory System. This module integrates static semantic memory ($\mathcal{M}_{sem}$) and dynamic episodic memory ($\mathcal{M}_{epi}$) to drive coherent dialogue. The NLI-Verifier acts as a logical gatekeeper, evaluating candidate responses ($r_t$) against atomic memory ($\mathcal{M}_t$). By regenerating on Contradictions ($\mathcal{C}$), preserving state on Entailments ($\mathcal{E}$), and expanding on Neutral ($\mathcal{N}$) information, this closed-loop effectively resolves the Stability-Plasticity Dilemma.
  • Figure 4: Expert Evaluation Results. (a) Senior licensed physicians showed near-chance discrimination between real and synthetic records, with Patient-Zero judged more frequently as human-authored. (b) Experts rated our framework highest in overall clinical quality.
  • Figure 5: Holistic Epidemiological and Semantic Alignment.(a) Epidemiological Manifold Coverage: The tight overlap of t-SNE visualization in the high-dimensional semantic space indicates that our synthetic data captures the comprehensive epidemiological manifold without mode collapse. (b) Attribute-Level Prevalence Alignment: Prevalence comparison across three dimensions. Hollow circles ($\circ$) represent real-world baseline; solid circles ($\bullet$) represent Patient-Zero; the connecting lines indicate alignment gaps. Minimal absolute differences demonstrate that Patient-Zero faithfully reconstructs complex real-world demographic and behavioral profiles.

Theorems & Definitions (1)

  • Definition D.1: Null Hypothesis for Distribution Alignment