Tracing Pharmacological Knowledge In Large Language Models

Basil Hasan Khwaja; Dylan Chen; Guntas Toor; Anastasiya Kuznetsova

Tracing Pharmacological Knowledge In Large Language Models

Basil Hasan Khwaja, Dylan Chen, Guntas Toor, Anastasiya Kuznetsova

TL;DR

This study provides the first systematic mechanistic analysis of pharmacological knowledge in LLMs, offering insights into how biomedical semantics are encoded in large language models.

Abstract

Large language models (LLMs) have shown strong empirical performance across pharmacology and drug discovery tasks, yet the internal mechanisms by which they encode pharmacological knowledge remain poorly understood. In this work, we investigate how drug-group semantics are represented and retrieved within Llama-based biomedical language models using causal and probing-based interpretability methods. We apply activation patching to localize where drug-group information is stored across model layers and token positions, and complement this analysis with linear probes trained on token-level and sum-pooled activations. Our results demonstrate that early layers play a key role in encoding drug-group knowledge, with the strongest causal effects arising from intermediate tokens within the drug-group span rather than the final drug-group token. Linear probing further reveals that pharmacological semantics are distributed across tokens and are already present in the embedding space, with token-level probes performing near chance while sum-pooled representations achieve maximal accuracy. Together, these findings suggest that drug-group semantics in LLMs are not localized to single tokens but instead arise from distributed representations. This study provides the first systematic mechanistic analysis of pharmacological knowledge in LLMs, offering insights into how biomedical semantics are encoded in large language models.

Tracing Pharmacological Knowledge In Large Language Models

TL;DR

This study provides the first systematic mechanistic analysis of pharmacological knowledge in LLMs, offering insights into how biomedical semantics are encoded in large language models.

Abstract

Paper Structure (18 sections, 1 equation, 4 figures, 10 tables)

This paper contains 18 sections, 1 equation, 4 figures, 10 tables.

Introduction
Methods
Dataset construction for benchmarking and activation patching
Activation Patching
Linear probing
Results
Biomedical and general-purpose LLMs encode drug class–name relationships
Activation patching of the residual stream of Llama-based models
Activation patching of the MLP layers of Llama-based models
Semantic representations are distributed across tokens for Llama-based models
Prior Work
Mechanistic interpretability
Mechanistic interpretability for biomedical LLMs
Limitations
Conclusions
...and 3 more sections

Figures (4)

Figure 1: Logit difference distributions for clean (blue) and counterfactual (orange) prompts across Llama-based models.
Figure 2: Activation patching of the Llama-3.1-8B-Instruct on the random prompt
Figure 4: Activation patching of the Llama-3.1-8B-Instruct on the random prompt
Figure 6: Metrics of linear probes trained on activations extracted from Llama-based models. Blue and orange lines indicate train and test metrics for the model trained on individual token activations, while red line shows both train and test performance for the model trained on sum-pooled tokens.

Tracing Pharmacological Knowledge In Large Language Models

TL;DR

Abstract

Tracing Pharmacological Knowledge In Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)