Governance-Ready Small Language Models for Medical Imaging: Prompting, Abstention, and PACS Integration
Yiting Wang, Ziwei Wang, Di Zhu, Jiachen Zhong, Weiyi Li
TL;DR
The paper tackles the governance and operational challenges of deploying small language models for medical-imaging utilities, specifically AP/PA view tagging in chest radiographs. It proposes a prompt-first framework with three tiers, a decision-theoretic abstention policy, and standards-aware integration into PACS, DICOM, HL7 v2, and FHIR to ensure auditability and safety. Through illustrative experiments on NIH Chest X-ray data with four deployable SLMs, it demonstrates how prompt design and calibrated abstention affect accuracy, calibration (ECE), and human oversight workload, while maintaining deterministic, reproducible IO contracts. The work emphasizes an auditable evidence pack, human-factors RACI, and a staged pathway from sandboxing to reader studies, aiming for practical, governance-ready deployment rather than diagnostic claims.
Abstract
Small Language Models (SLMs) are a practical option for narrow, workflow-relevant medical imaging utilities where privacy, latency, and cost dominate. We present a governance-ready recipe that combines prompt scaffolds, calibrated abstention, and standards-compliant integration into Picture Archiving and Communication Systems (PACS). Our focus is the assistive task of AP/PA view tagging for chest radiographs. Using four deployable SLMs (Qwen2.5-VL, MiniCPM-V, Gemma 7B, LLaVA 7B) on NIH Chest X-ray, we provide illustrative evidence: reflection-oriented prompts benefit lighter models, whereas stronger baselines are less sensitive. Beyond accuracy, we operationalize abstention, expected calibration error, and oversight burden, and we map outputs to DICOM tags, HL7 v2 messages, and FHIR ImagingStudy. The contribution is a prompt-first deployment framework, an operations playbook for calibration, logging, and change management, and a clear pathway from pilot utilities to reader studies without over-claiming clinical validation. We additionally specify a human-factors RACI, stratified calibration for dataset shift, and an auditable evidence pack to support local governance reviews.
