Table of Contents
Fetching ...

Plug-and-Play Controllable Generation for Discrete Masked Models

Wei Guo, Yuchen Zhu, Molei Tao, Yongxin Chen

TL;DR

This work addresses controllable generation for discrete masked models by introducing a plug-and-play sampler that relies on a mean-field approximation and iterative masking/remasking. The method samples from a target distribution $q(x) \propto r(x) p(x)$ without retraining the base model, using $K$ Monte Carlo samples and a remasking schedule to manage dependencies. The approach is demonstrated on constrained sequence sampling and protein design, showing its ability to satisfy non-differentiable rewards and improve target metrics (e.g., hydropathy, alpha-helix content) while preserving stability. Its practical impact lies in enabling flexible, training-free control of discrete generative models across domains like biology and chemistry, with efficient querying of the underlying masked models.

Abstract

This article makes discrete masked models for the generative modeling of discrete data controllable. The goal is to generate samples of a discrete random variable that adheres to a posterior distribution, satisfies specific constraints, or optimizes a reward function. This methodological development enables broad applications across downstream tasks such as class-specific image generation and protein design. Existing approaches for controllable generation of masked models typically rely on task-specific fine-tuning or additional modifications, which can be inefficient and resource-intensive. To overcome these limitations, we propose a novel plug-and-play framework based on importance sampling that bypasses the need for training a conditional score. Our framework is agnostic to the choice of control criteria, requires no gradient information, and is well-suited for tasks such as posterior sampling, Bayesian inverse problems, and constrained generation. We demonstrate the effectiveness of our approach through extensive experiments, showcasing its versatility across multiple domains, including protein design.

Plug-and-Play Controllable Generation for Discrete Masked Models

TL;DR

This work addresses controllable generation for discrete masked models by introducing a plug-and-play sampler that relies on a mean-field approximation and iterative masking/remasking. The method samples from a target distribution without retraining the base model, using Monte Carlo samples and a remasking schedule to manage dependencies. The approach is demonstrated on constrained sequence sampling and protein design, showing its ability to satisfy non-differentiable rewards and improve target metrics (e.g., hydropathy, alpha-helix content) while preserving stability. Its practical impact lies in enabling flexible, training-free control of discrete generative models across domains like biology and chemistry, with efficient querying of the underlying masked models.

Abstract

This article makes discrete masked models for the generative modeling of discrete data controllable. The goal is to generate samples of a discrete random variable that adheres to a posterior distribution, satisfies specific constraints, or optimizes a reward function. This methodological development enables broad applications across downstream tasks such as class-specific image generation and protein design. Existing approaches for controllable generation of masked models typically rely on task-specific fine-tuning or additional modifications, which can be inefficient and resource-intensive. To overcome these limitations, we propose a novel plug-and-play framework based on importance sampling that bypasses the need for training a conditional score. Our framework is agnostic to the choice of control criteria, requires no gradient information, and is well-suited for tasks such as posterior sampling, Bayesian inverse problems, and constrained generation. We demonstrate the effectiveness of our approach through extensive experiments, showcasing its versatility across multiple domains, including protein design.
Paper Structure (19 sections, 1 theorem, 12 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 19 sections, 1 theorem, 12 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

Suppose two probability masses or densities $p,q$ are related through $q(x)=\frac{1}{Z}r(x)p(x)$. Then, with $x_1,\dots,x_K$ i.i.d. samples from $p$, one can approximate $q$ with the following weighted empirical distribution:

Figures (6)

  • Figure 1: A demonstration of \ref{['alg:sample']} with vocabulary size $N=4$, sequence length $D=8$, and number of Monte Carlo estimate $K=6$.
  • Figure 2: Result for sampling equality-constrained sequences.
  • Figure 3: Structure of protein sequences predicted by ESM3. The upper row are the uncontrolled generated sequences, and the lower row are the controlled generated sequences. The sequences are randomly chosen.
  • Figure 4: Visualization of the generated sequence from protein inpainting.
  • Figure 5: Influence of $w_1$ and $A_1$ on helix% in sampling alpha-helix rich proteins.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Lemma 1: Importance Sampling
  • proof