Curriculum-Augmented GFlowNets For mRNA Sequence Generation
Aya Laajil, Abduragim Shtanchaev, Sajan Muhammad, Eric Moulines, Salem Lahlou
TL;DR
The paper tackles de novo mRNA sequence design under multiple objectives with long sequences and sparse rewards. It introduces Curriculum-Augmented GFlowNets (CAGFN), which integrates an adaptive length-based curriculum with a reward-conditioned, multi-objective GFlowNet, facilitated by a CodonDesignEnv for generating synonymous mRNA sequences encoding a target protein. Key contributions include a TSCL-driven curriculum that prioritizes tasks by learning progress, a conditional GFlowNet that models a family of Pareto-optimal designs, and empirical evidence showing improved Pareto front coverage, faster convergence, and robustness to out-of-distribution sequences, alongside analysis of SubTB versus TB losses for long sequences. The results demonstrate substantial improvements in design quality, diversity, and training efficiency, with practical implications for therapeutic mRNA design and broader long-horizon, multi-objective sequence generation tasks.
Abstract
Designing mRNA sequences is a major challenge in developing next-generation therapeutics, since it involves exploring a vast space of possible nucleotide combinations while optimizing sequence properties like stability, translation efficiency, and protein expression. While Generative Flow Networks are promising for this task, their training is hindered by sparse, long-horizon rewards and multi-objective trade-offs. We propose Curriculum-Augmented GFlowNets (CAGFN), which integrate curriculum learning with multi-objective GFlowNets to generate de novo mRNA sequences. CAGFN integrates a length-based curriculum that progressively adapts the maximum sequence length guiding exploration from easier to harder subproblems. We also provide a new mRNA design environment for GFlowNets which, given a target protein sequence and a combination of biological objectives, allows for the training of models that generate plausible mRNA candidates. This provides a biologically motivated setting for applying and advancing GFlowNets in therapeutic sequence design. On different mRNA design tasks, CAGFN improves Pareto performance and biological plausibility, while maintaining diversity. Moreover, CAGFN reaches higher-quality solutions faster than a GFlowNet trained with random sequence sampling (no curriculum), and enables generalization to out-of-distribution sequences.
