Table of Contents
Fetching ...

Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences

Minsu Kim, Federico Berto, Sungsoo Ahn, Jinkyoo Park

TL;DR

Offline biological sequence design seeks high-scoring designs using only a fixed offline dataset. BootGen improves this setting by training multiple score-conditioned generators with rank-based weighting and bootstrapped data labeled by a proxy, then aggregating across models with filtering to ensure reliability and diversity. Across GFP, UTR, TFBind, and RNA tasks, BootGen outperforms competitive baselines and maintains performance under limited evaluation budgets. The approach offers practical benefits for protein, DNA, and RNA design by enabling efficient exploration of high-quality designs without repeated lab evaluations.

Abstract

We study the problem of optimizing biological sequences, e.g., proteins, DNA, and RNA, to maximize a black-box score function that is only evaluated in an offline dataset. We propose a novel solution, bootstrapped training of score-conditioned generator (BootGen) algorithm. Our algorithm repeats a two-stage process. In the first stage, our algorithm trains the biological sequence generator with rank-based weights to enhance the accuracy of sequence generation based on high scores. The subsequent stage involves bootstrapping, which augments the training dataset with self-generated data labeled by a proxy score function. Our key idea is to align the score-based generation with a proxy score function, which distills the knowledge of the proxy score function to the generator. After training, we aggregate samples from multiple bootstrapped generators and proxies to produce a diverse design. Extensive experiments show that our method outperforms competitive baselines on biological sequential design tasks. We provide reproducible source code: \href{https://github.com/kaist-silab/bootgen}{https://github.com/kaist-silab/bootgen}.

Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences

TL;DR

Offline biological sequence design seeks high-scoring designs using only a fixed offline dataset. BootGen improves this setting by training multiple score-conditioned generators with rank-based weighting and bootstrapped data labeled by a proxy, then aggregating across models with filtering to ensure reliability and diversity. Across GFP, UTR, TFBind, and RNA tasks, BootGen outperforms competitive baselines and maintains performance under limited evaluation budgets. The approach offers practical benefits for protein, DNA, and RNA design by enabling efficient exploration of high-quality designs without repeated lab evaluations.

Abstract

We study the problem of optimizing biological sequences, e.g., proteins, DNA, and RNA, to maximize a black-box score function that is only evaluated in an offline dataset. We propose a novel solution, bootstrapped training of score-conditioned generator (BootGen) algorithm. Our algorithm repeats a two-stage process. In the first stage, our algorithm trains the biological sequence generator with rank-based weights to enhance the accuracy of sequence generation based on high scores. The subsequent stage involves bootstrapping, which augments the training dataset with self-generated data labeled by a proxy score function. Our key idea is to align the score-based generation with a proxy score function, which distills the knowledge of the proxy score function to the generator. After training, we aggregate samples from multiple bootstrapped generators and proxies to produce a diverse design. Extensive experiments show that our method outperforms competitive baselines on biological sequential design tasks. We provide reproducible source code: \href{https://github.com/kaist-silab/bootgen}{https://github.com/kaist-silab/bootgen}.
Paper Structure (36 sections, 3 equations, 7 figures, 8 tables, 2 algorithms)

This paper contains 36 sections, 3 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 2.1: Illustration of the bootstrapped training process for learning score-conditioned generator.
  • Figure 4.1: Evaluation-performance graph to compare with representative offline biological design baselines. The number of evaluations $K \in [1, 128]$ stands for the number of candidate designs to be evaluated by the Oracle score function. The average value and standard deviation error bar for 8 independent runs are reported. Our method outperforms other baselines at every task for almost all $K$.
  • Figure 4.2: Multi-objectivity comparison of diversity and novelty on the average score for the UTR task. Each datapoint for 8 independent runs is depicted.
  • Figure : Bootrapped Training of Score-conditioned generators
  • Figure : Aggregation Strategy for Sample Generation
  • ...and 2 more figures