Medical Interpretability and Knowledge Maps of Large Language Models
Razvan Marinescu, Victoria-Elisabeth Gruber, Diego Fajardo
TL;DR
This paper addresses how medical knowledge is represented inside large language models by applying four interpretability techniques to five open-source LLMs. It constructs LLM maps that locate knowledge about age, symptoms, diseases, drugs, and dosages by integrating UMAP activations, gradient saliency, layer lesioning, and activation patching. Key findings include non-linear age encoding with a 18-year discontinuity, circular disease progression in several diseases, drug representations clustering by medical specialty, and occasional activation collapse in Gemma/MedGemma. The results offer practical guidance for targeted fine-tuning, un-learning, and debiasing of medical LLMs and illuminate layer-specific dynamics relevant for domain adaptation.
Abstract
We present a systematic study of medical-domain interpretability in Large Language Models (LLMs). We study how the LLMs both represent and process medical knowledge through four different interpretability techniques: (1) UMAP projections of intermediate activations, (2) gradient-based saliency with respect to the model weights, (3) layer lesioning/removal and (4) activation patching. We present knowledge maps of five LLMs which show, at a coarse-resolution, where knowledge about patient's ages, medical symptoms, diseases and drugs is stored in the models. In particular for Llama3.3-70B, we find that most medical knowledge is processed in the first half of the model's layers. In addition, we find several interesting phenomena: (i) age is often encoded in a non-linear and sometimes discontinuous manner at intermediate layers in the models, (ii) the disease progression representation is non-monotonic and circular at certain layers of the model, (iii) in Llama3.3-70B, drugs cluster better by medical specialty rather than mechanism of action, especially for Llama3.3-70B and (iv) Gemma3-27B and MedGemma-27B have activations that collapse at intermediate layers but recover by the final layers. These results can guide future research on fine-tuning, un-learning or de-biasing LLMs for medical tasks by suggesting at which layers in the model these techniques should be applied.
