Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning
John Wu, David Wu, Jimeng Sun
TL;DR
This work tackles explainability in automated ICD coding by moving beyond label attention toward dictionary learning. It develops AutoCodeDL, a framework that extracts sparse, interpretable dictionary features from dense PLM embeddings via $L_1$-based and SPINE sparse autoencoders, then maps these features to ICD codes through ablations of token embeddings. By integrating dictionary features with LAAT, the approach provides more faithful, mechanistic explanations and enables steering of model predictions, while introducing automated metrics and human-centric evaluations (coherence, distinctiveness, stop-word analyses). The study demonstrates improved explainability over baselines and reveals insights into the hidden meanings embedded in clinical text, offering a scalable path toward transparent AI in medical coding. Limitations include reconstruction gaps and feature dead zones, suggesting future work on causal/disentangled representations and larger, richer dictionaries to better match ICD granularity.
Abstract
Medical coding, the translation of unstructured clinical text into standardized medical codes, is a crucial but time-consuming healthcare practice. Though large language models (LLM) could automate the coding process and improve the efficiency of such tasks, interpretability remains paramount for maintaining patient trust. Current efforts in interpretability of medical coding applications rely heavily on label attention mechanisms, which often leads to the highlighting of extraneous tokens irrelevant to the ICD code. To facilitate accurate interpretability in medical language models, this paper leverages dictionary learning that can efficiently extract sparsely activated representations from dense language model embeddings in superposition. Compared with common label attention mechanisms, our model goes beyond token-level representations by building an interpretable dictionary which enhances the mechanistic-based explanations for each ICD code prediction, even when the highlighted tokens are medically irrelevant. We show that dictionary features can steer model behavior, elucidate the hidden meanings of upwards of 90% of medically irrelevant tokens, and are human interpretable.
