Table of Contents
Fetching ...

Medical Interpretability and Knowledge Maps of Large Language Models

Razvan Marinescu, Victoria-Elisabeth Gruber, Diego Fajardo

TL;DR

This paper addresses how medical knowledge is represented inside large language models by applying four interpretability techniques to five open-source LLMs. It constructs LLM maps that locate knowledge about age, symptoms, diseases, drugs, and dosages by integrating UMAP activations, gradient saliency, layer lesioning, and activation patching. Key findings include non-linear age encoding with a 18-year discontinuity, circular disease progression in several diseases, drug representations clustering by medical specialty, and occasional activation collapse in Gemma/MedGemma. The results offer practical guidance for targeted fine-tuning, un-learning, and debiasing of medical LLMs and illuminate layer-specific dynamics relevant for domain adaptation.

Abstract

We present a systematic study of medical-domain interpretability in Large Language Models (LLMs). We study how the LLMs both represent and process medical knowledge through four different interpretability techniques: (1) UMAP projections of intermediate activations, (2) gradient-based saliency with respect to the model weights, (3) layer lesioning/removal and (4) activation patching. We present knowledge maps of five LLMs which show, at a coarse-resolution, where knowledge about patient's ages, medical symptoms, diseases and drugs is stored in the models. In particular for Llama3.3-70B, we find that most medical knowledge is processed in the first half of the model's layers. In addition, we find several interesting phenomena: (i) age is often encoded in a non-linear and sometimes discontinuous manner at intermediate layers in the models, (ii) the disease progression representation is non-monotonic and circular at certain layers of the model, (iii) in Llama3.3-70B, drugs cluster better by medical specialty rather than mechanism of action, especially for Llama3.3-70B and (iv) Gemma3-27B and MedGemma-27B have activations that collapse at intermediate layers but recover by the final layers. These results can guide future research on fine-tuning, un-learning or de-biasing LLMs for medical tasks by suggesting at which layers in the model these techniques should be applied.

Medical Interpretability and Knowledge Maps of Large Language Models

TL;DR

This paper addresses how medical knowledge is represented inside large language models by applying four interpretability techniques to five open-source LLMs. It constructs LLM maps that locate knowledge about age, symptoms, diseases, drugs, and dosages by integrating UMAP activations, gradient saliency, layer lesioning, and activation patching. Key findings include non-linear age encoding with a 18-year discontinuity, circular disease progression in several diseases, drug representations clustering by medical specialty, and occasional activation collapse in Gemma/MedGemma. The results offer practical guidance for targeted fine-tuning, un-learning, and debiasing of medical LLMs and illuminate layer-specific dynamics relevant for domain adaptation.

Abstract

We present a systematic study of medical-domain interpretability in Large Language Models (LLMs). We study how the LLMs both represent and process medical knowledge through four different interpretability techniques: (1) UMAP projections of intermediate activations, (2) gradient-based saliency with respect to the model weights, (3) layer lesioning/removal and (4) activation patching. We present knowledge maps of five LLMs which show, at a coarse-resolution, where knowledge about patient's ages, medical symptoms, diseases and drugs is stored in the models. In particular for Llama3.3-70B, we find that most medical knowledge is processed in the first half of the model's layers. In addition, we find several interesting phenomena: (i) age is often encoded in a non-linear and sometimes discontinuous manner at intermediate layers in the models, (ii) the disease progression representation is non-monotonic and circular at certain layers of the model, (iii) in Llama3.3-70B, drugs cluster better by medical specialty rather than mechanism of action, especially for Llama3.3-70B and (iv) Gemma3-27B and MedGemma-27B have activations that collapse at intermediate layers but recover by the final layers. These results can guide future research on fine-tuning, un-learning or de-biasing LLMs for medical tasks by suggesting at which layers in the model these techniques should be applied.

Paper Structure

This paper contains 11 sections, 1 equation, 24 figures, 5 tables.

Figures (24)

  • Figure 1: Overview of our Medical LLM Interpretability study, outlining the process to build LLM maps.
  • Figure 2: LLM Map for Llama3.3-70B showing where medical knowledge about age, symptoms, diseases, drugs and drug dosages is stored in the model. Each interval shown is estimated quantitatively from four types of analyses: clutering structure in UMAP embeddings, high weight saliency values, high degradation upon layer lesioning/ablation and high patching effect in activation patching.
  • Figure 3: UMAP Analysis using subjects with different ages in Llama3.3-70B. The age manifold shows non-linearities throughout many intermediate layers, as well as a discontinuity between subjects younger than 17 and those 18 or older. Prompts used are shown at the bottom of the figure. The bottom-right plot confirms that the only discontinuity is at age 18.
  • Figure 4: Disease Progression UMAP in Llama3.3-70B. The model shows circular, non-monotonic disease progression, in particular for Alzheimer's disease (stages 2 and 4 are closest to the final stage at many layers) and Parkinson's disease (stage 7 is closest to first stage at most layers). Details about each stage as it was prompted in the model is shown in Appendix Table. \ref{['tab:disease_progression_prompts']}.
  • Figure 5: Drug UMAP embeddings colored by mechanism of action in Llama3.3-70B.
  • ...and 19 more figures