Table of Contents
Fetching ...

Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning

John Wu, David Wu, Jimeng Sun

TL;DR

This work tackles explainability in automated ICD coding by moving beyond label attention toward dictionary learning. It develops AutoCodeDL, a framework that extracts sparse, interpretable dictionary features from dense PLM embeddings via $L_1$-based and SPINE sparse autoencoders, then maps these features to ICD codes through ablations of token embeddings. By integrating dictionary features with LAAT, the approach provides more faithful, mechanistic explanations and enables steering of model predictions, while introducing automated metrics and human-centric evaluations (coherence, distinctiveness, stop-word analyses). The study demonstrates improved explainability over baselines and reveals insights into the hidden meanings embedded in clinical text, offering a scalable path toward transparent AI in medical coding. Limitations include reconstruction gaps and feature dead zones, suggesting future work on causal/disentangled representations and larger, richer dictionaries to better match ICD granularity.

Abstract

Medical coding, the translation of unstructured clinical text into standardized medical codes, is a crucial but time-consuming healthcare practice. Though large language models (LLM) could automate the coding process and improve the efficiency of such tasks, interpretability remains paramount for maintaining patient trust. Current efforts in interpretability of medical coding applications rely heavily on label attention mechanisms, which often leads to the highlighting of extraneous tokens irrelevant to the ICD code. To facilitate accurate interpretability in medical language models, this paper leverages dictionary learning that can efficiently extract sparsely activated representations from dense language model embeddings in superposition. Compared with common label attention mechanisms, our model goes beyond token-level representations by building an interpretable dictionary which enhances the mechanistic-based explanations for each ICD code prediction, even when the highlighted tokens are medically irrelevant. We show that dictionary features can steer model behavior, elucidate the hidden meanings of upwards of 90% of medically irrelevant tokens, and are human interpretable.

Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning

TL;DR

This work tackles explainability in automated ICD coding by moving beyond label attention toward dictionary learning. It develops AutoCodeDL, a framework that extracts sparse, interpretable dictionary features from dense PLM embeddings via -based and SPINE sparse autoencoders, then maps these features to ICD codes through ablations of token embeddings. By integrating dictionary features with LAAT, the approach provides more faithful, mechanistic explanations and enables steering of model predictions, while introducing automated metrics and human-centric evaluations (coherence, distinctiveness, stop-word analyses). The study demonstrates improved explainability over baselines and reveals insights into the hidden meanings embedded in clinical text, offering a scalable path toward transparent AI in medical coding. Limitations include reconstruction gaps and feature dead zones, suggesting future work on causal/disentangled representations and larger, richer dictionaries to better match ICD granularity.

Abstract

Medical coding, the translation of unstructured clinical text into standardized medical codes, is a crucial but time-consuming healthcare practice. Though large language models (LLM) could automate the coding process and improve the efficiency of such tasks, interpretability remains paramount for maintaining patient trust. Current efforts in interpretability of medical coding applications rely heavily on label attention mechanisms, which often leads to the highlighting of extraneous tokens irrelevant to the ICD code. To facilitate accurate interpretability in medical language models, this paper leverages dictionary learning that can efficiently extract sparsely activated representations from dense language model embeddings in superposition. Compared with common label attention mechanisms, our model goes beyond token-level representations by building an interpretable dictionary which enhances the mechanistic-based explanations for each ICD code prediction, even when the highlighted tokens are medically irrelevant. We show that dictionary features can steer model behavior, elucidate the hidden meanings of upwards of 90% of medically irrelevant tokens, and are human interpretable.

Paper Structure

This paper contains 30 sections, 9 equations, 17 figures, 14 tables, 2 algorithms.

Figures (17)

  • Figure 1: Motivation: LAAT identifies the most relevant tokens for each ICD code (b). Compared to our inspection of which tokens are most relevant to an ICD code (a), we assume "and" is irrelevant to an ICD code prediction. Although it may appear as though "and" is irrelevantly highlighted, taking token embeddings out of superposition allows us to decompose dense token embeddings into more semantically meaningful dictionary features that show that concepts of "failure of wound healing" are embedded within its token embedding (c), thereby giving justification for its highlighting by LAAT for a wound-related ICD code.
  • Figure 2: Building a dictionary involves several steps: A sparse autoencoder decomposes each token embedding into a sparse latent space, where each nonzero element represents a dictionary feature ID (step 1). This process enables the creation of mappings between tokens and various dictionary features and ICD codes. In step 2, ICD codes are mapped to dictionary features based on the softmax probabilities of each ICD prediction after dictionary embedding ablations, as detailed in section \ref{['sec:Method Exp']}. Once a dictionary is constructed, it is utilized to enhance explanations by applying it to highlighted tokens identified by LAAT in Figure \ref{['fig:proposed_method']}.
  • Figure 3: Proposed method for automated ICD interpretability pipeline: AutoCodeDL. LAAT identifies the most important words "and wound". Then, the sparse autoencoder queries its most activated dictionary features, returning its respective dictionary feature ids that can be leveraged to further explain the PLM's predictions and attention highlights.
  • Figure 4: UMAP of SPINE Embeddings: Dictionary features are interpretable in steering model behavior. More darker red colors indicates higher maximum observed probability increase of the top medical code from a feature's exclusive clamping. Each dot is a dictionary feature embedding projected into 2D.
  • Figure 5: Label Attention identifies the most relevant tokens for each ICD code through a label attention matrix.
  • ...and 12 more figures