Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving

Chang Su; Zhongkai Hao; Zhizhou Zhang; Zeyu Xia; Youjia Wu; Hang Su; Jun Zhu

Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving

Chang Su, Zhongkai Hao, Zhizhou Zhang, Zeyu Xia, Youjia Wu, Hang Su, Jun Zhu

TL;DR

HELIX introduces two key novelties: a diverse yet high-quality pool of candidate solutions that broadens exploration through in-context learning, and reinforcement learning for iterative policy refinement that progressively elevates solution quality.

Abstract

Large language models (LLMs) with reasoning abilities have demonstrated growing promise for tackling complex scientific problems. Yet such tasks are inherently domain-specific, unbounded and open-ended, demanding exploration across vast and flexible solution spaces. Existing approaches, whether purely learning-based or reliant on carefully designed workflows, often suffer from limited exploration efficiency and poor generalization. To overcome these challenges, we present HELIX -- a Hierarchical Evolutionary reinforcement Learning framework with In-context eXperiences. HELIX introduces two key novelties: (i) a diverse yet high-quality pool of candidate solutions that broadens exploration through in-context learning, and (ii) reinforcement learning for iterative policy refinement that progressively elevates solution quality. This synergy enables the discovery of more advanced solutions. On the circle packing task, HELIX achieves state-of-the-art result with a sum of radii of 2.63598308 using only a 14B model. Across standard machine learning benchmarks, HELIX further surpasses GPT-4o with a carefully engineered pipeline, delivering an average F1 improvement of 5.95 points on the Adult and Bank Marketing datasets.

Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving

TL;DR

Abstract

Paper Structure (82 sections, 2 theorems, 45 equations, 22 figures, 4 tables, 1 algorithm)

This paper contains 82 sections, 2 theorems, 45 equations, 22 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Reinforcement learning of LLMs.
Evolutionary algorithms.
Proposed Method
Overview
Policy Optimization Aligned with Evolutionary Search
Evolutionary Mechanism for Balancing Quality and Diversity
Diversity measurement.
NSGA-II based selection.
Experiment
Experiment Setting
Tasks.
Models.
Baselines.
...and 67 more sections

Key Result

Theorem F.3

Let $v_0$ be the initial solution for GRPO. GRPO will converge to the local optimum near $v_{loc}$ if the following conditions are met:

Figures (22)

Figure 1: The figure demonstrates how our framework progressively discovers new insights and refines solutions over iterations. (a): Reward curve for the housing dataset optimization, where improvements are achieved through iterative adoption of better models, parameter tuning, and feature engineering, with the final reward of 1.758 corresponding to an RMSE of 1.747. (b): Reward curves for the beam and inductor design tasks, where the algorithm explores novel geometries and combines favorable structural features to enhance performance.
Figure 2: Illustration of HELIX framework. The workflow begins with a dataset containing task descriptions and a pool of initial solutions, which are taken by LLM as inputs. The LLM will modify and update the original solution and generate a new one, represented as descendants in lineage tree. After the evaluation pipeline, the resulting reward-labeled solutions will be used to update policy parameters via reinforcement learning. Those samples will also be selected by NSGA-II algorithm to construct promising yet diverse candidate solutions for population evolution.
Figure 3: Convergence analysis on the Inductor and Adult tasks. The curves show the progressive improvement of average reward and validity during training, demonstrating that our framework effectively leverages reinforcement learning feedback and evolutionary dynamics to produce increasingly valid and high-quality solutions.
Figure 4: Ablation analysis of framework components. (a): Maximum reward achieved by different ablation variants. (b): Curve of epoch-wise maximum reward on the Circle Packing task, highlighting the critical role of balancing diversity and quality for stable optimization. (c): Curve of epoch-wise maximum reward on the Boston Housing task, showing the necessity of combining reinforcement learning with evolutionary guidance.
Figure 5: Scaling analysis of model parameter scale on (a): magnetic torque maximization and (b): inductor design tasks.
...and 17 more figures

Theorems & Definitions (3)

Definition F.1: LLM Transition Process
Theorem F.3: Convergence to Local Optimum of GRPO
Theorem F.4: Stationary Distribution

Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving

TL;DR

Abstract

Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (3)