Table of Contents
Fetching ...

Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

Marco Jiralerspong, Esther Derman, Danilo Vucetic, Nikolay Malkin, Bilun Sun, Tianyu Zhang, Pierre-Luc Bacon, Gauthier Gidel

TL;DR

This work tackles the challenge of discovering high-quality, diverse candidates in discrete compositional processes by revisiting sampling-by-reward in the GFN/soft-RL framework. It introduces General Mellowmax (GM), an interpolating regularizer between entropy and KL-based objectives, and Trajectory General Mellowmax (TGM), a trajectory-level variant that enforces a novel consistency constraint. The authors establish a robust RL interpretation via Fenchel-robust MDPs, linking regularization to reward uncertainty and deriving state-rectangular uncertainty sets that more faithfully capture compositional uncertainty. Empirically, TGM achieves higher quality, more diverse mode discovery, and robustness across synthetic and real-world biological design tasks, outperforming GFNs, SAC, and PPO in several domains. The results suggest that the GM/TGM framework provides a scalable, principled approach for navigating exponentially large search spaces in scientific discovery problems.

Abstract

A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.

Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

TL;DR

This work tackles the challenge of discovering high-quality, diverse candidates in discrete compositional processes by revisiting sampling-by-reward in the GFN/soft-RL framework. It introduces General Mellowmax (GM), an interpolating regularizer between entropy and KL-based objectives, and Trajectory General Mellowmax (TGM), a trajectory-level variant that enforces a novel consistency constraint. The authors establish a robust RL interpretation via Fenchel-robust MDPs, linking regularization to reward uncertainty and deriving state-rectangular uncertainty sets that more faithfully capture compositional uncertainty. Empirically, TGM achieves higher quality, more diverse mode discovery, and robustness across synthetic and real-world biological design tasks, outperforming GFNs, SAC, and PPO in several domains. The results suggest that the GM/TGM framework provides a scalable, principled approach for navigating exponentially large search spaces in scientific discovery problems.

Abstract

A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.

Paper Structure

This paper contains 59 sections, 17 theorems, 73 equations, 7 figures, 4 tables.

Key Result

Theorem 3.1

Assume that the reward function $r$ is uncertain and satisfies $r_s\in {\mathcal{R}}_s:= r_0(s,\cdot) + \tilde{{\mathcal{R}}}_s$, where $\tilde{{\mathcal{R}}}_s\subseteq [-R,R]^{{\mathcal{A}}}$ is closed and convex for all $s\in{\mathcal{S}}$. Then, for any $\pi\in\Pi$, the robust value function $v

Figures (7)

  • Figure 1: (Left) Illustrated issue with sampling proportional to reward. The much larger reward of the optimal sequence (in green) is drowned out by all the rewards in the combinatorial explosion of longer subsequences. (Right) A DCP for a protein sequence generation task. (A) Starting from an empty string, amino acids are added sequentially until termination. (B) Then, the full sequence is evaluated by a proxy reward function $\Phi(x)$ whose value is given as reward for the termination action. (C) Over the course of protein generation, uncertainty accumulates through the $\delta_i$. (D) The true reward depends on $\Phi(x)$and the accumulated uncertainty.
  • Figure 2: Distribution of exponential rewards from 1 million uniformly randomly drawn samples for various tasks compared to the distribution sampled by TGM at the end of training.
  • Figure 3: (Left) Regardless of $\omega$, the uncertainty set never contains $\Phi(x)$. As a result, the soft Bellman/GFN operator is only robust to increases in reward. (Middle) For the soft mellowmax operator, for different values of $\alpha$ with $d_s[1]>d_s[2]$, the uncertainty set contains $\Phi(x)$. Thus, the operator is robust to decreases in reward of one object (but not both at the same time). When $\alpha=0$ (mellowmax), there is a symmetry in this tradeoff, while increasing $\alpha$ skews it such that the object with higher $d_s$ only admits a small decrease in reward. (Right) The uncertainty sets for GM interpolate between the two effects. While the uncertainty set for $0 < \mathtt{q} < 1$ does not contain $\Phi(x)$, it does contain points corresponding to decreases in reward.
  • Figure 4: (Left) Comparison of the optimal sampling distribution of GFN and variants of TGM for TF-Bind-8 rewards. For the same $\beta = 4$, TGM concentrates significantly more mass on the upper quantiles of the reward distribution. (Middle) Number of modes found by TGM and GFN. The increased peakiness of the TGM sampling does not harm its ability to find different modes. (Right) Average (over modes) of the distance of the closest sample found for each mode. On average, increasing $\mathtt{q}$ allows TGM to find closer samples to the true modes.
  • Figure 5: Spread of final average mode rewards for various algorithms from a grid sweep over learning rates, $\beta$ and $\omega$. TGM on average performs better in AMP and GFP and similarly in UTR.
  • ...and 2 more figures

Theorems & Definitions (30)

  • Theorem 3.1: derman2021twice
  • Proposition 4.1
  • Theorem 4.1: Trajectory GM
  • Theorem 5.1: Fenchel-Robust MDP
  • Definition A.1: Convex conjugate
  • Definition A.2: Infimal convolution
  • Proposition A.1: Shannon conjugate
  • Proposition A.2: KL conjugate
  • Proposition A.3: Shannon-KL conjugate
  • proof
  • ...and 20 more