Table of Contents
Fetching ...

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

Joseph Breda, Fadi Yousif, Beszel Hawkins, Marinela Cotoi, Miao Liu, Ray Luo, Po-Hsuan Cameron Chen, Mike Schaekermann, Samuel Schmidgall, Xin Liu, Girish Narayanswamy, Samuel Solomon, Maxwell A. Xu, Xiaoran Fan, Longfei Shangguan, Anran Wang, Bhavna Daryani, Buddy Herkenham, Cara Tan, Mark Malhotra, Shwetak Patel, John B. Hernandez, Quang Duong, Yun Liu, Zach Wasson, Dimitrios Antos, Bob Lou, Matthew Thompson, Jonathan Richina, Anupam Pathak, Nichole Young-Lin, Jake Sunshine, Daniel McDuff

Abstract

Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.47, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

Abstract

Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.47, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

Paper Structure

This paper contains 30 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: SymptomAI Study. (a-b) Experimental deployment study procedure of SymptomAI for end-to-end patient interviewing and generative AI differential diagnosis (DDx) for symptom assessment that were benchmarked against study participant-reported diagnoses recieved from a Health Care Provider (HCP). (c-d) This led to a large dataset (N=13,917) of naturalistic symptom conversations communicated by laypeople paired with recent wearable data. (e) We leveraged clinical expert annotation to validate SymptomAI DDx against and to inform the development of an LLM verifier (i.e., auto-rater) for expanding validation beyond the clinical evaluation sub-sample. (f) Leveraging SymptomAI as a phenotype labeler enables phenome-wide analysis of biosignals across the study population.
  • Figure 2: Clinical evaluation and user engagement of SymptomAI. (a) Proportion of SymptomAI and clinician DDx (normalized by category total) ranked by blinded clinicians as 1st, 2nd, and 3rd position amongst a randomized list of 3 possible DDx lists for each conversation (one SymptomAI and two clinician baselines per trial). (b) The average top-5 accuracy assigned by clinicians to DDx produced by clinicians and SymptomAI. (c) Top-5 accuracy of clinicians and SymptomAI stratified by conversation prompting strategy. (d) Total user words sent across all user messages across each prompting strategy. Horizontal line denotes median word count. (e) The top-5 accuracy assigned by clinicians to SymptomAI and baseline clinician-generated DDx stratified by clinician's confidence that the conversation contained enough information to support a plausibly accurate DDx. (f) The top-5 accuracy assigned by clinicians to SymptomAI and baseline clinician-generated DDx stratified by clinician's rating of confidence in their own DDx for that conversation.
  • Figure 3: Phenome-wide Association Study to explore the relation of wearable biosignals and AI-generated diagnoses. All phenome-wide analyses were performed using multiple logistic regression models adjusted for age, sex, and weight and included biosignals averaged in a recent and historic window to capture temporality. The Bonferroni significance threshold per biosignal (ranging from $p < 2.2 \times 10^{-4}$ to $p < 2.6 \times 10^{-4}$) is indicated by a red line and a p-value of 0.05 is indicated by the blue line. Diamond points indicate associations driven primarily by the recent biosignal window (acute) while circular points indicate associations driven by the historic biosignal window (chronic).
  • Figure 4: Heatmap of significant relationships between top diagnoses and Wearable metrics. Odds ratios across all diagnoses and Fitbit-derived metrics that have at least one significant relationship. A heatmap of -log10(P) is overlaid on a table of significant associations between all incident phenotypes and Fitbit-derived metrics. Odd-ratio values (95% CI) are reported within each heatmap table box. Empty cells indicate insufficient data to train logistic regression for the given intersection.
  • Figure 5: biosignal trends for a cohort of participants diagnosed with respiratory infection relative to the time of SymptomAI conversations. The trends in selected wearable biosignals in days leading up to a SymptomAI conversation relative to a historic average from a 2-week baseline period starting 30 days before the conversation for the infected and baseline cohorts. The infected cohort includes participants which SymptomAI diagnosed with a respiratory infection while the baseline includes all other participants in our dataset. Day 0 (dotted line) denotes the date of the SymptomAI conversation.
  • ...and 11 more figures