Table of Contents
Fetching ...

Governance-Ready Small Language Models for Medical Imaging: Prompting, Abstention, and PACS Integration

Yiting Wang, Ziwei Wang, Di Zhu, Jiachen Zhong, Weiyi Li

TL;DR

The paper tackles the governance and operational challenges of deploying small language models for medical-imaging utilities, specifically AP/PA view tagging in chest radiographs. It proposes a prompt-first framework with three tiers, a decision-theoretic abstention policy, and standards-aware integration into PACS, DICOM, HL7 v2, and FHIR to ensure auditability and safety. Through illustrative experiments on NIH Chest X-ray data with four deployable SLMs, it demonstrates how prompt design and calibrated abstention affect accuracy, calibration (ECE), and human oversight workload, while maintaining deterministic, reproducible IO contracts. The work emphasizes an auditable evidence pack, human-factors RACI, and a staged pathway from sandboxing to reader studies, aiming for practical, governance-ready deployment rather than diagnostic claims.

Abstract

Small Language Models (SLMs) are a practical option for narrow, workflow-relevant medical imaging utilities where privacy, latency, and cost dominate. We present a governance-ready recipe that combines prompt scaffolds, calibrated abstention, and standards-compliant integration into Picture Archiving and Communication Systems (PACS). Our focus is the assistive task of AP/PA view tagging for chest radiographs. Using four deployable SLMs (Qwen2.5-VL, MiniCPM-V, Gemma 7B, LLaVA 7B) on NIH Chest X-ray, we provide illustrative evidence: reflection-oriented prompts benefit lighter models, whereas stronger baselines are less sensitive. Beyond accuracy, we operationalize abstention, expected calibration error, and oversight burden, and we map outputs to DICOM tags, HL7 v2 messages, and FHIR ImagingStudy. The contribution is a prompt-first deployment framework, an operations playbook for calibration, logging, and change management, and a clear pathway from pilot utilities to reader studies without over-claiming clinical validation. We additionally specify a human-factors RACI, stratified calibration for dataset shift, and an auditable evidence pack to support local governance reviews.

Governance-Ready Small Language Models for Medical Imaging: Prompting, Abstention, and PACS Integration

TL;DR

The paper tackles the governance and operational challenges of deploying small language models for medical-imaging utilities, specifically AP/PA view tagging in chest radiographs. It proposes a prompt-first framework with three tiers, a decision-theoretic abstention policy, and standards-aware integration into PACS, DICOM, HL7 v2, and FHIR to ensure auditability and safety. Through illustrative experiments on NIH Chest X-ray data with four deployable SLMs, it demonstrates how prompt design and calibrated abstention affect accuracy, calibration (ECE), and human oversight workload, while maintaining deterministic, reproducible IO contracts. The work emphasizes an auditable evidence pack, human-factors RACI, and a staged pathway from sandboxing to reader studies, aiming for practical, governance-ready deployment rather than diagnostic claims.

Abstract

Small Language Models (SLMs) are a practical option for narrow, workflow-relevant medical imaging utilities where privacy, latency, and cost dominate. We present a governance-ready recipe that combines prompt scaffolds, calibrated abstention, and standards-compliant integration into Picture Archiving and Communication Systems (PACS). Our focus is the assistive task of AP/PA view tagging for chest radiographs. Using four deployable SLMs (Qwen2.5-VL, MiniCPM-V, Gemma 7B, LLaVA 7B) on NIH Chest X-ray, we provide illustrative evidence: reflection-oriented prompts benefit lighter models, whereas stronger baselines are less sensitive. Beyond accuracy, we operationalize abstention, expected calibration error, and oversight burden, and we map outputs to DICOM tags, HL7 v2 messages, and FHIR ImagingStudy. The contribution is a prompt-first deployment framework, an operations playbook for calibration, logging, and change management, and a clear pathway from pilot utilities to reader studies without over-claiming clinical validation. We additionally specify a human-factors RACI, stratified calibration for dataset shift, and an auditable evidence pack to support local governance reviews.

Paper Structure

This paper contains 10 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Risk–utility frontier under site-specific cost profiles (illustrative).
  • Figure 2: Tier ablation (illustrative). Reflection benefits lighter SLMs; stronger baselines are less sensitive.
  • Figure 3: Confidence–coverage curves under abstention sweep (illustrative).
  • Figure 4: Reliability diagram (illustrative). Deviation from the diagonal informs ECE.
  • Figure 5: Pilot trend: reflection helps lighter models; stronger baselines are less sensitive.
  • ...and 2 more figures