Table of Contents
Fetching ...

PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Samah Fodeh, Linhai Ma, Yan Wang, Srivani Talakokkul, Ganesh Puthiaraju, Afshan Khan, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

TL;DR

PVminer is introduced, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication that achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines.

Abstract

Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.

PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

TL;DR

PVminer is introduced, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication that achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines.

Abstract

Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.
Paper Structure (37 sections, 7 equations, 4 figures, 10 tables)

This paper contains 37 sections, 7 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Overview of the PVminer framework. (1) A BERT baseline model is domain-adapted by pre-training on 6M unlabeled patient-authored messages using a masked language modeling objective, producing PV-BERT-base and PV-BERT-large. (2) A BERTopic model is trained on 500K unlabeled messages to learn latent thematic structure and generate topic keywords, producing PV-Topic-BERT. (3) During fine-tuning, PV-BERT models are augmented with topic cues from the pre-trained PV-Topic-BERT model and trained on annotated patient voice data to predict structured labels (Code, Subcode, Combo). (4) During inference, given a new message, topic cues are again retrieved from the PV-Topic-BERT model, and the fine-tuned PV-BERT produces the structured patient voice representation.
  • Figure 2: Distribution of Code categories across all annotated patient messages. Each message may contain multiple non-overlapping labels.
  • Figure 3: Distribution of Subcodes representing fine-grained PV categories.
  • Figure 4: Distribution of Combo labels (Code–Subcode pairs) in the annotated dataset. Absent Subcodes are labeled as None.