Table of Contents
Fetching ...

Curriculum-Augmented GFlowNets For mRNA Sequence Generation

Aya Laajil, Abduragim Shtanchaev, Sajan Muhammad, Eric Moulines, Salem Lahlou

TL;DR

The paper tackles de novo mRNA sequence design under multiple objectives with long sequences and sparse rewards. It introduces Curriculum-Augmented GFlowNets (CAGFN), which integrates an adaptive length-based curriculum with a reward-conditioned, multi-objective GFlowNet, facilitated by a CodonDesignEnv for generating synonymous mRNA sequences encoding a target protein. Key contributions include a TSCL-driven curriculum that prioritizes tasks by learning progress, a conditional GFlowNet that models a family of Pareto-optimal designs, and empirical evidence showing improved Pareto front coverage, faster convergence, and robustness to out-of-distribution sequences, alongside analysis of SubTB versus TB losses for long sequences. The results demonstrate substantial improvements in design quality, diversity, and training efficiency, with practical implications for therapeutic mRNA design and broader long-horizon, multi-objective sequence generation tasks.

Abstract

Designing mRNA sequences is a major challenge in developing next-generation therapeutics, since it involves exploring a vast space of possible nucleotide combinations while optimizing sequence properties like stability, translation efficiency, and protein expression. While Generative Flow Networks are promising for this task, their training is hindered by sparse, long-horizon rewards and multi-objective trade-offs. We propose Curriculum-Augmented GFlowNets (CAGFN), which integrate curriculum learning with multi-objective GFlowNets to generate de novo mRNA sequences. CAGFN integrates a length-based curriculum that progressively adapts the maximum sequence length guiding exploration from easier to harder subproblems. We also provide a new mRNA design environment for GFlowNets which, given a target protein sequence and a combination of biological objectives, allows for the training of models that generate plausible mRNA candidates. This provides a biologically motivated setting for applying and advancing GFlowNets in therapeutic sequence design. On different mRNA design tasks, CAGFN improves Pareto performance and biological plausibility, while maintaining diversity. Moreover, CAGFN reaches higher-quality solutions faster than a GFlowNet trained with random sequence sampling (no curriculum), and enables generalization to out-of-distribution sequences.

Curriculum-Augmented GFlowNets For mRNA Sequence Generation

TL;DR

The paper tackles de novo mRNA sequence design under multiple objectives with long sequences and sparse rewards. It introduces Curriculum-Augmented GFlowNets (CAGFN), which integrates an adaptive length-based curriculum with a reward-conditioned, multi-objective GFlowNet, facilitated by a CodonDesignEnv for generating synonymous mRNA sequences encoding a target protein. Key contributions include a TSCL-driven curriculum that prioritizes tasks by learning progress, a conditional GFlowNet that models a family of Pareto-optimal designs, and empirical evidence showing improved Pareto front coverage, faster convergence, and robustness to out-of-distribution sequences, alongside analysis of SubTB versus TB losses for long sequences. The results demonstrate substantial improvements in design quality, diversity, and training efficiency, with practical implications for therapeutic mRNA design and broader long-horizon, multi-objective sequence generation tasks.

Abstract

Designing mRNA sequences is a major challenge in developing next-generation therapeutics, since it involves exploring a vast space of possible nucleotide combinations while optimizing sequence properties like stability, translation efficiency, and protein expression. While Generative Flow Networks are promising for this task, their training is hindered by sparse, long-horizon rewards and multi-objective trade-offs. We propose Curriculum-Augmented GFlowNets (CAGFN), which integrate curriculum learning with multi-objective GFlowNets to generate de novo mRNA sequences. CAGFN integrates a length-based curriculum that progressively adapts the maximum sequence length guiding exploration from easier to harder subproblems. We also provide a new mRNA design environment for GFlowNets which, given a target protein sequence and a combination of biological objectives, allows for the training of models that generate plausible mRNA candidates. This provides a biologically motivated setting for applying and advancing GFlowNets in therapeutic sequence design. On different mRNA design tasks, CAGFN improves Pareto performance and biological plausibility, while maintaining diversity. Moreover, CAGFN reaches higher-quality solutions faster than a GFlowNet trained with random sequence sampling (no curriculum), and enables generalization to out-of-distribution sequences.

Paper Structure

This paper contains 48 sections, 11 equations, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: Directed acyclic graph (DAG) representation of the mRNA sequence design process. Each node corresponds to a codon selection step, and edges represent possible transitions, illustrating the sequential construction of mRNA sequences. This DAG (tree in this case) forms the basis for the GFlowNet to explore diverse and high-quality sequences.
  • Figure 2: Unconditional mRNA sequence generation with GFlowNets under fixed objective weights $[0.3, 0.3, 0.4]$. (a) Pareto front. (b) Distribution of reward metrics: The spread demonstrates that the model explores diverse regions of the design space rather than collapsing to a single mode. Vertical lines represent the natural sequence scores.
  • Figure 3: Conditional Vs Unconditional mRNA generation results. Metrics distribution across mRNA sequences of a small protein of interest ($\sim$35AA). More details in Figure \ref{['fig:bars']} of Appendix \ref{['app:additional-plots']}.
  • Figure 4: Training differences.
  • Figure 5: Averaged metrics across 100 generated sequences for a small protein task.
  • ...and 6 more figures