Inference-Time Toxicity Mitigation in Protein Language Models

Manuel Fernández Burda; Santiago Aranguri; Iván Arcuschin Moreno; Enzo Ferrante

Inference-Time Toxicity Mitigation in Protein Language Models

Manuel Fernández Burda, Santiago Aranguri, Iván Arcuschin Moreno, Enzo Ferrante

Abstract

Protein language models (PLMs) are becoming practical tools for de novo protein design, yet their dual-use potential raises safety concerns. We show that domain adaptation to specific taxonomic groups can elicit toxic protein generation, even when toxicity is not the training objective. To address this, we adapt Logit Diff Amplification (LDA) as an inference-time control mechanism for PLMs. LDA modifies token probabilities by amplifying the logit difference between a baseline model and a toxicity-finetuned model, requiring no retraining. Across four taxonomic groups, LDA consistently reduces predicted toxicity rate (measured via ToxDL2) below the taxon-finetuned baseline while preserving biological plausibility. We evaluate quality using Fréchet ESM Distance and predicted foldability (pLDDT), finding that LDA maintains distributional similarity to natural proteins and structural viability (unlike activation-based steering methods that tend to degrade sequence properties). Our results demonstrate that LDA provides a practical safety knob for protein generators that mitigates elicited toxicity while retaining generative quality.

Inference-Time Toxicity Mitigation in Protein Language Models

Abstract

Paper Structure (24 sections, 1 equation, 9 figures, 2 tables)

This paper contains 24 sections, 1 equation, 9 figures, 2 tables.

Introduction
Methods
Experimental Setup
Models and Finetuning.
Toxicity Scoring.
Quality Metrics.
Logit Diff Amplification
Baseline Steering Methods
Results
Finetuning Elicits Toxicity
LDA Mitigates Toxicity Across Taxa
LDA Preserves Biological Quality
Discussion & Conclusions
Appendix
Taxonomic Finetuning Elicits Toxic Behaviour
...and 9 more sections

Figures (9)

Figure 1: LDA reduces predicted toxicity across taxa. Percentage of generated sequences classified as toxic by ToxDL2 (lower is better) for four taxonomic finetunes. Baseline denotes the corresponding taxon finetune, Toxic denotes models finetuned on toxin-enriched data from within that taxon and LDA denotes the intervened models at inference time. Bars report mean ± s.e.m. across three independent generation runs under identical sampling and perplexity filtering.
Figure 1: Taxon finetuning elicits toxic generation. Toxicity rates for baseline ProGen2 versus taxon-finetuned models across four taxonomic groups. Error bars show $\pm$1 standard deviation.
Figure 2: Steering intensity unveils mitigation regimes. Toxicity rate versus amplification strength $\alpha$ for (a) Arthropoda (log scale), (b) Gastropoda, (c) Lepidosauria, and (d) Arachnida. Dashed lines indicate taxon-finetuned baseline (gray) and toxic-finetuned model (red). For all taxa, there exists an $\alpha$ range where toxicity drops below baseline.
Figure 3: LDA maintains sequence quality.$\Delta$FED (left) and $\Delta$pLDDT (right) versus $\alpha$ for LDA across taxa.
Figure 4: Linear probing reveals toxicity encoding across layers. Classification metrics (Accuracy, AUC-ROC, F1) for linear probes trained on layer-wise activations. Performance increases with depth, indicating toxicity-related information emerges gradually and becomes linearly accessible in intermediate-to-final layers. Metrics statistics are calculated over 5 random species-stratified splits.
...and 4 more figures

Inference-Time Toxicity Mitigation in Protein Language Models

Abstract

Inference-Time Toxicity Mitigation in Protein Language Models

Authors

Abstract

Table of Contents

Figures (9)