Table of Contents
Fetching ...

Linguistic Blind Spots in Clinical Decision Extraction

Mohamed Elgaar, Hadi Amiri

TL;DR

Problem: automatic extraction of medical decisions from clinical notes is impeded by systematic linguistic variation across decision types. Approach: analyze MedDec discharge summaries annotated with DICTUM categories using seven linguistic indices and evaluate a fixed RoBERTa-based span extractor under exact-match and relaxed IoU criteria. Findings: entity-dense, telegraphic decision spans (drug-related and defining problem) are extracted more reliably than narrative, advice-like spans (advice/precaution); exact-match recall is 48%, with a 34-point drop across stopword-based strata, while a relaxed IoU criterion raises recall to 71%, highlighting boundary errors as a major source of failures. Implications: results motivate boundary-tolerant evaluation and targeted improvements for narrative-style decisions, with considerations for deployment and potential demographic variation in documentation.

Abstract

Extracting medical decisions from clinical notes is a key step for clinical decision support and patient-facing care summaries. We study how the linguistic characteristics of clinical decisions vary across decision categories and whether these differences explain extraction failures. Using MedDec discharge summaries annotated with decision categories from the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM), we compute seven linguistic indices for each decision span and analyze span-level extraction recall of a standard transformer model. We find clear category-specific signatures: drug-related and problem-defining decisions are entity-dense and telegraphic, whereas advice and precaution decisions contain more narrative, with higher stopword and pronoun proportions and more frequent hedging and negation cues. On the validation split, exact-match recall is 48%, with large gaps across linguistic strata: recall drops from 58% to 24% from the lowest to highest stopword-proportion bins, and spans containing hedging or negation cues are less likely to be recovered. Under a relaxed overlap-based match criterion, recall increases to 71%, indicating that many errors are span boundary disagreements rather than complete misses. Overall, narrative-style spans--common in advice and precaution decisions--are a consistent blind spot under exact matching, suggesting that downstream systems should incorporate boundary-tolerant evaluation and extraction strategies for clinical decisions.

Linguistic Blind Spots in Clinical Decision Extraction

TL;DR

Problem: automatic extraction of medical decisions from clinical notes is impeded by systematic linguistic variation across decision types. Approach: analyze MedDec discharge summaries annotated with DICTUM categories using seven linguistic indices and evaluate a fixed RoBERTa-based span extractor under exact-match and relaxed IoU criteria. Findings: entity-dense, telegraphic decision spans (drug-related and defining problem) are extracted more reliably than narrative, advice-like spans (advice/precaution); exact-match recall is 48%, with a 34-point drop across stopword-based strata, while a relaxed IoU criterion raises recall to 71%, highlighting boundary errors as a major source of failures. Implications: results motivate boundary-tolerant evaluation and targeted improvements for narrative-style decisions, with considerations for deployment and potential demographic variation in documentation.

Abstract

Extracting medical decisions from clinical notes is a key step for clinical decision support and patient-facing care summaries. We study how the linguistic characteristics of clinical decisions vary across decision categories and whether these differences explain extraction failures. Using MedDec discharge summaries annotated with decision categories from the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM), we compute seven linguistic indices for each decision span and analyze span-level extraction recall of a standard transformer model. We find clear category-specific signatures: drug-related and problem-defining decisions are entity-dense and telegraphic, whereas advice and precaution decisions contain more narrative, with higher stopword and pronoun proportions and more frequent hedging and negation cues. On the validation split, exact-match recall is 48%, with large gaps across linguistic strata: recall drops from 58% to 24% from the lowest to highest stopword-proportion bins, and spans containing hedging or negation cues are less likely to be recovered. Under a relaxed overlap-based match criterion, recall increases to 71%, indicating that many errors are span boundary disagreements rather than complete misses. Overall, narrative-style spans--common in advice and precaution decisions--are a consistent blind spot under exact matching, suggesting that downstream systems should incorporate boundary-tolerant evaluation and extraction strategies for clinical decisions.
Paper Structure (19 sections, 5 figures, 6 tables)

This paper contains 19 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Z-scored mean linguistic indices by decision category. For each index, values are z-scored across all plotted spans as $z=(x-\mu)/\sigma$ before computing category means. Red indicates above-average values; blue indicates below-average. Advice and precaution decisions show elevated hedging, negation, and pronoun use.
  • Figure 2: Linguistic index distributions by decision category. Error bars indicate 95% cluster-bootstrap confidence intervals. Categories are ordered by mean FKGL. Binary indices (hedge/negation presence) report the proportion of spans containing the marker.
  • Figure 3: Span-level extraction recall by linguistic index bins (exact match). Dashed line indicates overall recall (48%). Error bars show 95% cluster-bootstrap confidence intervals. Higher stopword proportions are strongly associated with lower recall.
  • Figure 4: Span-level extraction recall by linguistic index bins under a relaxed overlap-based match criterion (IoU $\ge 0.5$ within category). Dashed line indicates overall recall (71%).
  • Figure 5: Z-scored linguistic indices by patient demographics. Documentation for non-English-speaking and female patients shows lower entity density and more narrative style.