Protein Counterfactuals via Diffusion-Guided Latent Optimization

Weronika Kłos; Sidney Bender; Lukas Kades

Protein Counterfactuals via Diffusion-Guided Latent Optimization

Weronika Kłos, Sidney Bender, Lukas Kades

TL;DR

Manifold-Constrained Counterfactual Optimization for Proteins (MCCOP), a framework that computes minimal, biologically plausible sequence edits that flip a model's prediction to a desired target state, is introduced.

Abstract

Deep learning models can predict protein properties with unprecedented accuracy but rarely offer mechanistic insight or actionable guidance for engineering improved variants. When a model flags an antibody as unstable, the protein engineer is left without recourse: which mutations would rescue stability while preserving function? We introduce Manifold-Constrained Counterfactual Optimization for Proteins (MCCOP), a framework that computes minimal, biologically plausible sequence edits that flip a model's prediction to a desired target state. MCCOP operates in a continuous joint sequence-structure latent space and employs a pretrained diffusion model as a manifold prior, balancing three objectives: validity (achieving the target property), proximity (minimizing mutations), and plausibility (producing foldable proteins). We evaluate MCCOP on three protein engineering tasks - GFP fluorescence rescue, thermodynamic stability enhancement, and E3 ligase activity recovery - and show that it generates sparser, more plausible counterfactuals than both discrete and continuous baselines. The recovered mutations align with known biophysical mechanisms, including chromophore packing and hydrophobic core consolidation, establishing MCCOP as a tool for both model interpretation and hypothesis-driven protein design. Our code is publicly available at github.com/weroks/mccop.

Protein Counterfactuals via Diffusion-Guided Latent Optimization

TL;DR

Abstract

Paper Structure (47 sections, 5 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 47 sections, 5 equations, 7 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Protein language models and embeddings.
Diffusion and generative protein design.
Explainability in the protein domain.
Counterfactual explanations.
Methods
Problem Formulation
Latent Representation
Predictor Smoothing
Counterfactual Optimization
Objective Function
Gradient-Based Sparsity Masking
Manifold Projection
Experimental Setup
...and 32 more sections

Figures (7)

Figure 1: Overview of MCCOP. A non-fluorescent GFP variant is mapped to a continuous joint sequence–structure latent space via a pretrained autoencoder. After smoothing the classification boundary, the counterfactual embedding is optimized by alternating between (1) a sparse gradient step maximizing target class probability and (2) manifold projection using a pretrained diffusion model (DiMA). The counterfactual shown is a rediscovered sample from the held-out test set (red sticks denote the mutated residue).
Figure 2: Physicochemical plausibility across benchmarks (columns: pLDDT, GRAVY, instability index, $R_g$; rows: fluorescence, activity, stability). MCCOP (orange) closely matches the original distribution (gray); discrete baselines show broader shifts. Statistical comparisons via Kruskal-Wallis/Dunn's tests with Benjamini-Hochberg correction: MCCOP achieves significantly higher pLDDT than both baselines across all tasks (adjusted $p < 0.02$).
Figure 3: Per-residue mutation frequency for fluorescence (A) and Ube4b activity (B). MCCOP (blue) concentrates mutations in functionally relevant regions -- chromophore-proximal residues for GFP, E2-binding interface for Ube4b -- while baselines distribute mutations nearly uniformly. Shaded regions: known functional motifs sarkisyan2016localstarita2013activity.
Figure 4: Structural alignments between original (gray) and counterfactual (colored) proteins for rediscovered ground-truth examples. (A) GFP: mutations near the chromophore. (B) Stability: core-facing mutations. (C) Ube4b: E2-binding interface mutations. Structures predicted by ESM3.
Figure 5: Average wall-clock time per sample across the complete test set. The genetic algorithm is the most expensive method. MCCOP and stochastic hill climbing have comparable execution times, though their computational profiles differ: discrete baselines spend $>$94% of time on re-encoding candidate sequences, while MCCOP spends 99% on diffusion-based manifold projection.
...and 2 more figures

Protein Counterfactuals via Diffusion-Guided Latent Optimization

TL;DR

Abstract

Protein Counterfactuals via Diffusion-Guided Latent Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)