Table of Contents
Fetching ...

A thermodynamic metric quantitatively predicts disordered protein partitioning and multicomponent phase behavior

Zhuang Liu, Beijia Yuan, Mihir Rao, Gautam Reddy, William M. Jacobs

TL;DR

A thermodynamic model is introduced that quantitatively predicts the behavior of arbitrary combinations of IDRs across a wide range of concentrations, with accuracy comparable to state-of-the-art simulations, and establishes a unified and interpretable framework for predicting and understanding the behavior of complex mixtures of IDRs and other sequence-dependent biomolecules.

Abstract

Intrinsically disordered regions (IDRs) of proteins mediate sequence-specific interactions underlying diverse cellular processes, including the formation of biomolecular condensates. Although IDRs strongly influence condensate compositions, quantitative frameworks that predict and explain their phase behavior in complex mixtures remain lacking. Here we introduce a thermodynamic model that quantitatively predicts the behavior of arbitrary combinations of IDRs across a wide range of concentrations, with accuracy comparable to state-of-the-art simulations. The model learns low-dimensional, context-independent representations of IDR sequences that combine to form mixture representations, producing context-dependent interactions. These representations define a thermodynamic metric space in which distances between IDRs correspond directly to differences in their thermodynamic properties. We show that the model predicts multicomponent phase diagrams in quantitative agreement with molecular simulations without being trained on free-energy or phase-coexistence data. The metric space provides geometrically intuitive predictions of IDR partitioning, multicomponent condensation, and context-dependent mutational effects, addressing several central problems in IDR biophysics within a single model. Systematic interrogation of the learned representations reveals how amino-acid composition and sequence patterning jointly determine mixture thermodynamics. Together, our results establish a unified and interpretable framework for predicting and understanding the behavior of complex mixtures of IDRs and other sequence-dependent biomolecules.

A thermodynamic metric quantitatively predicts disordered protein partitioning and multicomponent phase behavior

TL;DR

A thermodynamic model is introduced that quantitatively predicts the behavior of arbitrary combinations of IDRs across a wide range of concentrations, with accuracy comparable to state-of-the-art simulations, and establishes a unified and interpretable framework for predicting and understanding the behavior of complex mixtures of IDRs and other sequence-dependent biomolecules.

Abstract

Intrinsically disordered regions (IDRs) of proteins mediate sequence-specific interactions underlying diverse cellular processes, including the formation of biomolecular condensates. Although IDRs strongly influence condensate compositions, quantitative frameworks that predict and explain their phase behavior in complex mixtures remain lacking. Here we introduce a thermodynamic model that quantitatively predicts the behavior of arbitrary combinations of IDRs across a wide range of concentrations, with accuracy comparable to state-of-the-art simulations. The model learns low-dimensional, context-independent representations of IDR sequences that combine to form mixture representations, producing context-dependent interactions. These representations define a thermodynamic metric space in which distances between IDRs correspond directly to differences in their thermodynamic properties. We show that the model predicts multicomponent phase diagrams in quantitative agreement with molecular simulations without being trained on free-energy or phase-coexistence data. The metric space provides geometrically intuitive predictions of IDR partitioning, multicomponent condensation, and context-dependent mutational effects, addressing several central problems in IDR biophysics within a single model. Systematic interrogation of the learned representations reveals how amino-acid composition and sequence patterning jointly determine mixture thermodynamics. Together, our results establish a unified and interpretable framework for predicting and understanding the behavior of complex mixtures of IDRs and other sequence-dependent biomolecules.
Paper Structure (14 sections, 6 equations, 5 figures)

This paper contains 14 sections, 6 equations, 5 figures.

Figures (5)

  • Figure 1: An interpretable thermodynamic model quantitatively predicts IDR interactions in multicomponent mixtures. (a) Context-independent IDR feature vectors combine in a concentration-weighted average to define a mixture representation; these $d$-dimensional vectors predict context-dependent IDR interactions via the excess chemical potential function $\mu^{\text{ex}}$. (b) IDR feature vectors reside in a thermodynamic metric space in which distances correspond to a norm of the difference between $\mu^{\mathrm{ex}}$ functions. This space is uniquely determined by the mixture ensemble. (c) We curated training and test IDRs by fragmenting the human IDRome into equal-length sequences of 20 residues. This produced an ensemble of 335,439 minimally overlapping representative fragments, which were combined into mixtures by sampling from a mixture ensemble. (d) Convergence of the MLP-model metric space with increasing dimension $d$. (e) Two-dimensional projections of feature vectors in the $d=10$ MLP-model metric space, with distances in units of $kT$. Histograms show the distribution of all fragments in the sequence ensemble. (f) EOS test performance, $\mathrm{RMSE}(P/c_{\mathrm{tot}})$, on 10,440 random binary mixtures as a function of the metric-space dimension $d$ and (g) corresponding scatter plots for three models: FINCHES ginell2025sequence with fitting parameters chosen by training on EOS data, a learned pairwise model with $d=5$, and the MLP model with $d=10$. (h) EOS test performance on random $n$-component mixtures for the same three models. (i) A test set of 231 free-energy density differences, $\Delta f$, between random binary mixtures was constructed by performing explicit thermodynamic integration. (j) Scatter plots showing test performance on the FE test set for the three models considered in (g). (k) FE test performance, $\mathrm{RMSE}(\Delta f)$, on random $n$-component mixtures for the same three models. Error bars indicate the standard deviation over random mixture pairings. In (f--k), sequences that compose the random mixtures were sampled uniformly from the metric space of the $d=10$ MLP model.
  • Figure 2: The model quantitatively predicts multicomponent IDR phase diagrams. (a) Using representative sequences sampled uniformly from the metric space, we applied the $d = 10$ MLP model to predict a matrix of binary phase diagrams. (b) In this example phase diagram (for representative sequences c and i), blue points indicate concentrations at which the mixture is predicted to be homogeneous. Orange lines indicate two-phase regions; mixtures with parent concentrations within these regions are predicted to phase-separate into coexisting phases connected by the orange tie lines. (c) Predictions from (b) are compared with direct-coexistence molecular dynamics simulations at two selected parent concentrations, labeled I and II. Equilibrium simulation snapshots indicate that parent concentration I is a homogeneous mixture, whereas parent concentration II phase separates. The distributions of local concentrations observed in simulations are summarized by thresholded histograms (shaded regions) in (b). (d) Comparisons are quantified using the Wasserstein $W_1$ distance between the predicted one- or two-phase distribution in the thermodynamic limit and the distribution of local concentrations in direct-coexistence simulations at the same parent concentration. (e) The distribution of normalized $W_1$ distances was computed for a test set of 59 parent concentrations sampled from (a) for the three models considered in Fig. \ref{['fig:1']}f--k. The positive control is the smallest possible $W_1$ distance for each mixture given the simulated concentration distribution. The negative control was constructed by pairing simulations with random predictions that preserve the parent concentration. Examples from the test set are shown in Fig. S4.
  • Figure 3: The thermodynamic metric space organizes IDRs according to their physical interactions. (a) Partitioning specificity refers to the tendency of an IDR to be included in or excluded from condensates with different compositions. This can be predicted directly from the geometry of the metric space, since the excess chemical potential of an IDR is the inner product of (b) the IDR feature vector, $z_i$, and (c) the gradient of the excess free-energy density, $\nabla\Psi$, evaluated at $c_{\mathrm{tot}} \bar{z}$. (d) This construction yields a classifier for inclusion, $\mu_i^{\mathrm{ex}}(c_{\mathrm{tot}} \bar{z}) < 0$, or exclusion, $\mu_i^{\mathrm{ex}}(c_{\mathrm{tot}} \bar{z}) > 0$, of the probe IDR $i$ in (b) as a function of the mixture representation $\bar{z}$ in (c). For visualization, we plot a two-dimensional slice of the metric space at constant $c_{\mathrm{tot}}$ in (c,d). (e) Condensation occurs when a concentrated droplet coexists with a dilute solution. (f) This can be predicted directly within the metric space by looking for solutions to the condensation coexistence equation $P_{\mathrm{coex}} \approx 0$, yielding a classifier for both individual IDRs and IDR mixtures. For visualization, we show a two-dimensional slice of the metric space. The colored region indicates where $P_{\mathrm{coex}} \approx 0$ has a nontrivial solution, leading to a droplet with the indicated $c_{\mathrm{tot}}$. A single IDR condenses if its feature vector lies within this region, whereas an IDR mixture undergoes mole fraction-dependent condensation if the convex hull of its components' feature vectors intersects this region. (g) These predictions are confirmed by unary and binary phase diagrams, respectively, for probe IDR 1 and the pair of probe IDRs 2 and 3 shown in (f); in the binary case, moving along the line segment in (f) corresponds to sweeping through mole fractions from 100% IDR 2 to 100% IDR 3 in (g). (h) The effects of point mutations on IDR interactions depend on the composition of the local environment. (i) The metric space can be exploited to quantify the effects of all single-amino-acid substitutions, $\|\Delta z\| = \|z_{\mathrm{MUT}}-z_{\mathrm{WT}}\|$. These are averaged over the IDR sequence ensemble to obtain a matrix of mean mutation distances. Because this calculation makes use of the uniform prior introduced in Fig. \ref{['fig:2']}a, it reflects mutational effects across all possible IDR mixtures. Amino acids are ordered according to the leading wild-type mode of variation in the mutation-distance matrix, indicating an overall ranking of thermodynamic importance. Specializing the prior to mixtures composed only of (j) FUS fragments, which are relatively hydrophobic, or (k) NPM1 fragments, which are enriched in negative charge, rescales metric distances, revealing context dependence across distinct ensembles of mixture environments. All calculations were performed using the $d=10$ MLP model.
  • Figure 4: Systematic interrogation of learned feature vectors quantifies the compositional and sequence-dependent determinants of IDR mixture thermodynamics. (a) The explained variance and corresponding metric-space loading vector, representing the optimal linear projection of theoretical descriptor(s) $Y$ onto learned feature vectors $Z$, are shown for each of eight individual descriptors, as well as for the 12-element Mpipi compositional descriptor setjoseph2021physics and the associated 40-element NARDINI patterning descriptor set.cohan2022uncovering Individual descriptors are net charge per residue (NCPR),mao2010net average hydrophobicity,joseph2021physics fraction of charged residues (FCR),mao2010net a mean-field summary of short-ranged interactions ($\mathrm{WF}_{\mathrm{pairs}}$),von2025prediction sequence hydropathy decoration (SHD),zheng2020hydropathy average residue size,joseph2021physics sequence charge decoration (SCD),firman2018sequence and the charge-patterning descriptor $\kappa$.das2013conformations (b) The mutual information $I(Y,Z)$, which quantifies the overall (nonlinear) relationship between $Y$ and $Z$, is shown along with the increase in $I(Y,Z)$ as each dimension is added to $Z$. (c) Theoretical descriptor sets that best approximate the metric space, according to greedy maximization of $I(Y,Z)$. The order in which the descriptors are added to the set is indicated; the final point uses all 60 descriptors, including Mpipi composition and NARDINI descriptors. The entropy of the 335,439 IDR feature vectors, $H(Z)$, is shown for comparison (black line and gray region indicating SEM). (d) The magnitude of composition-independent effects was quantified by comparing EOS simulations of mixtures of wild-type (WT) and mutant (MUT) IDR sequences sampled according to ESM2 likelihood;lin2023evolutionary MUT mixtures were constructed by replacing the WT sequences in the binary-mixture test set (Fig. \ref{['fig:1']}f,g). Red points indicate the test performance on WT and MUT mixtures given by the RMSE between prediction, $\hat{P}$, and simulation, $P$. Blue points show the results of swapping the MUT and WT labels when evaluating the model. The root-mean-square difference between paired WT and MUT simulations, $\langle((P_{\mathrm{MUT}} - P_{\mathrm{WT}}) / c_{\mathrm{tot}})^2\rangle$, is shown for comparison (black dashed line and gray region indicating SEM). (e) A counterfactual analysis considered all unique single-swap mutants, weighted in accordance with ESM2 likelihood, around each WT sequence. The histogram shows the distribution of the spread (RMSD of feature vectors relative to the centroid in the metric space) of these mutants around each of the 100,000 WT sequences in the test set. (f) The distributions of selected theoretical descriptors for WT sequences with extreme spreads, lying in the bottom (B, blue) or top (T, red) 10% of the spreads shown in (e). The ability of each of the 8 individual theoretical descriptors to predict the label T or B of an extreme WT sequence is quantified by the mutual information $I(\mathrm{label}, Y)$. (g) The average spread conditioned on the distance of the swapped positions from the end of the sequence and the WT amino acids at the swapped positions; higher values indicate that counterfactuals satisfying these criteria result in larger perturbations on average. (h) The average spread conditioned on whether the swapped positions affect a contiguous or non-contiguous motif within the WT sequence. Results are presented for motifs consisting of triplets of charged ($+$ or $-$), hydrophobic/polar (h or p), or, as a control, all ($*$) amino acids. All calculations were performed using the $d=10$ MLP model. Error bars, if shown, indicate SEM; otherwise, errors are smaller than the symbol sizes.
  • Figure 5: Symmetry-preserving framework for predicting mixture free energies. (a) Our custom machine-learning framework consists of an encoder $\phi$, which maps each sequence to its feature vector, and a decoder $\Psi$, which predicts the excess free-energy density, $\hat{f}^{\mathrm{ex}}$, from the mixture representation. Equilibrium thermodynamic properties are then obtained from $\hat{f}^{\mathrm{ex}}$. (b) We trained the model by performing coarse-grained molecular dynamics simulations joseph2021physicsLAMMPS of random IDR mixtures and minimizing the loss between the predicted, $\hat{P}$, and simulated, $P$, EOS. Simulations of mixtures of held-out sequences were used to construct independent test sets.