Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

Marco Jiralerspong; Esther Derman; Danilo Vucetic; Nikolay Malkin; Bilun Sun; Tianyu Zhang; Pierre-Luc Bacon; Gauthier Gidel

Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

Marco Jiralerspong, Esther Derman, Danilo Vucetic, Nikolay Malkin, Bilun Sun, Tianyu Zhang, Pierre-Luc Bacon, Gauthier Gidel

TL;DR

This work tackles the challenge of discovering high-quality, diverse candidates in discrete compositional processes by revisiting sampling-by-reward in the GFN/soft-RL framework. It introduces General Mellowmax (GM), an interpolating regularizer between entropy and KL-based objectives, and Trajectory General Mellowmax (TGM), a trajectory-level variant that enforces a novel consistency constraint. The authors establish a robust RL interpretation via Fenchel-robust MDPs, linking regularization to reward uncertainty and deriving state-rectangular uncertainty sets that more faithfully capture compositional uncertainty. Empirically, TGM achieves higher quality, more diverse mode discovery, and robustness across synthetic and real-world biological design tasks, outperforming GFNs, SAC, and PPO in several domains. The results suggest that the GM/TGM framework provides a scalable, principled approach for navigating exponentially large search spaces in scientific discovery problems.

Abstract

A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.

Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

TL;DR

Abstract

Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (30)