Aligning Transformers with Continuous Feedback via Energy Rank Alignment

Shriram Chennakesavalu; Frank Hu; Sebastian Ibarraran; Grant M. Rotskoff

Aligning Transformers with Continuous Feedback via Energy Rank Alignment

Shriram Chennakesavalu, Frank Hu, Sebastian Ibarraran, Grant M. Rotskoff

TL;DR

Energy Rank Alignment (ERA) addresses the challenge of steering autoregressive molecular and protein language models to generate outputs with externally specified properties by leveraging an explicit reward function $U$ within a gradient-based objective. The resulting optimal policy is a Gibbs-Boltzmann-like distribution $\pi_\star(\mathbf{y}|\mathbf{x}) \propto \exp\left[-\frac{\beta}{1+\gamma} U(\mathbf{x},\mathbf{y}) + \frac{\gamma}{1+\gamma}\log \pi_{\rm ref}(\mathbf{y}|\mathbf{x})\right]$, with temperature $\beta$ and regularization $\gamma$ controlling exploration versus exploitation. ERA provides a direct, differentiable objective that relates to PPO and DPO but preserves an explicit reward signal, enabling stable optimization with finite entropy and adaptable regularization; it is demonstrated to robustly align molecular transformers and a protein language model to diverse, high-scoring samples across chemistry and protein design tasks. The approach achieves competitive or superior sample efficiency and diversity compared to baselines, highlighting its practical potential for multi-property optimization and guided directed evolution, while acknowledging limitations such as dependence on a tractable reward model and lack of explicit synthesizability optimization.

Abstract

Searching through chemical space is an exceptionally challenging problem because the number of possible molecules grows combinatorially with the number of atoms. Large, autoregressive models trained on databases of chemical compounds have yielded powerful generators, but we still lack robust strategies for generating molecules with desired properties. This molecular search problem closely resembles the "alignment" problem for large language models, though for many chemical tasks we have a specific and easily evaluable reward function. Here, we introduce an algorithm called energy rank alignment (ERA) that leverages an explicit reward function to produce a gradient-based objective that we use to optimize autoregressive policies. We show theoretically that this algorithm is closely related to proximal policy optimization (PPO) and direct preference optimization (DPO), but has a minimizer that converges to an ideal Gibbs-Boltzmann distribution with the reward playing the role of an energy function. Furthermore, this algorithm is highly scalable, does not require reinforcement learning, and performs well relative to DPO when the number of preference observations per pairing is small. We deploy this approach to align molecular transformers and protein language models to generate molecules and protein sequences, respectively, with externally specified properties and find that it does so robustly, searching through diverse parts of chemical space.

Aligning Transformers with Continuous Feedback via Energy Rank Alignment

TL;DR

within a gradient-based objective. The resulting optimal policy is a Gibbs-Boltzmann-like distribution

, with temperature

and regularization

controlling exploration versus exploitation. ERA provides a direct, differentiable objective that relates to PPO and DPO but preserves an explicit reward signal, enabling stable optimization with finite entropy and adaptable regularization; it is demonstrated to robustly align molecular transformers and a protein language model to diverse, high-scoring samples across chemistry and protein design tasks. The approach achieves competitive or superior sample efficiency and diversity compared to baselines, highlighting its practical potential for multi-property optimization and guided directed evolution, while acknowledging limitations such as dependence on a tractable reward model and lack of explicit synthesizability optimization.

Abstract

Paper Structure (35 sections, 4 theorems, 37 equations, 13 figures, 10 tables)

This paper contains 35 sections, 4 theorems, 37 equations, 13 figures, 10 tables.

Introduction
Our contribution:
Related Work
Energy rank alignment
Loss functions for $\pi_{\boldsymbol{\theta}}$:
Theoretical Analysis
Experiments
Generating small molecules with desired properties
Unprompted molecular alignment on RDKit oracles
Prompted molecular alignment on RDKit oracles
Unprompted molecular alignment on protein-ligand docking oracles
Directed evolution of proteins with ERA
Conclusions and Limitations
Limitations:
Code Availability
...and 20 more sections

Key Result

Lemma 3.1

If $\pi$ is conditionally equivalent to $\pi'$, then $\pi_g' (\cdot|\boldsymbol{x}) \propto \pi'(\cdot |\boldsymbol{x}) e^{g(\boldsymbol{x})}$ is conditionally equivalent to $\pi$ for all functions $g:\mathcal{X}\to \mathbb{R}$ such that $\sup_{\boldsymbol{x}\in\mathcal{X}} |e^{g(\boldsymbol{x})}| <

Figures (13)

Figure 1: Energy rank alignment (ERA) enables targeting low-energy, high-reward regions with controllable fluctuations. Optimal policy approaches Boltzmann distribution with low regularization ($\gamma \to 0$) and reference policy with high regularization ($\gamma \to \infty$) (left). Aligned models can be used to sample molecules with desired chemical properties (right).
Figure 2: Unprompted molecular generator alignment. Distributions of different chemical properties for molecules sampled from aligned and unaligned policies. The center of the harmonic potential, $\mu$, is varied for MR ($\beta=1.0$), Ring Count ($\beta=1.0$), and LogP ($\beta=10.0$), while $\beta$ is varied for QED. All experiments were run with no regularization to the reference policy ($\gamma=0$).
Figure 3: Unprompted multi-property molecular generator alignment. 2D histograms of LogP versus QED for different combinations of property-specific $\beta$ illustrating a clear trade-off when performing multi-property alignment. Relative increases in $\beta$ for a given property target higher values for that property. All experiments were run with no regularization to the reference policy ($\gamma=0$).
Figure 4: Visualization of three generated ligands docked against the GSK3$\beta$ kinase target (top) and three generated ligands docked against the JNK3 kinase target (bottom). In each case, these were the three molecules with the best (most negative) Glide Standard Precision docking scores and oracle scores of 1.0.
Figure 5: Alignment of ESM3-1.4B with $\beta$=0, 0.1, 1.0, 10.0 and $\gamma$=0.001 on the task of maximizing EVmutation score. Positions 182, 183, 184, and 186 of the TrpB parent sequence were masked and ESM3-1.4B predicted amino acids at those sites. The distribution of the EVmutation scores for generated sequences shifts significantly as $\beta$ is increased.
...and 8 more figures

Theorems & Definitions (6)

Definition 3.1
Lemma 3.1
Proposition 3.2
Definition A.1
Lemma A.1
Proposition A.2

Aligning Transformers with Continuous Feedback via Energy Rank Alignment

TL;DR

Abstract

Aligning Transformers with Continuous Feedback via Energy Rank Alignment

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (6)