Table of Contents
Fetching ...

MultiGA: Leveraging Multi-Source Seeding in Genetic Algorithms

Isabelle Diana May-Xin Ng, Tharindu Cyril Weerasooriya, Haitao Zhu, Wei Wei

TL;DR

MultiGA introduces a genetic algorithm framework that seeds the initial population with outputs from multiple LLMs and relies on an independent evaluator to score and recombine candidates. This ensemble-based seeding mitigates reliance on any single model and promotes diversity, enabling robust performance across text-to-SQL, planning, graduate science questions, and bias evaluation. The approach demonstrates convergence toward the best-performing single model's accuracy while maintaining stability and resilience to weaker seeds. The study highlights the potential of cross-LLM collaboration and evaluator-driven recombination as a practical direction for tackling interdisciplinary and novel tasks without heavy model selection overhead.

Abstract

Large Language Models (LLMs) are widely used across research domains to tackle complex tasks, but their performance can vary significantly depending on the task at hand. Evolutionary algorithms, inspired by natural selection, can be used to refine solutions iteratively at inference-time. To the best of our knowledge, there has not been exploration on leveraging the collective capabilities of multi-source seeding for LLM-guided genetic algorithms. In this paper, we introduce a novel approach, MultiGA, which applies genetic algorithm principles to address complex natural language tasks and reasoning problems by sampling from a diverse population of LLMs to initialize the population. MultiGA generates a range of outputs from various parent LLMs, open source and closed source, and uses a neutral fitness function to evaluate them. Through an iterative recombination process, we mix and refine these generations until an optimal solution is achieved. We benchmark our approach using text-to-SQL code generation tasks, trip planning, GPQA benchmark for grad-level science questions, and the BBQ bias benchmark. Our results show that MultiGA converges to the accuracy of the LLM best fit for the task, and these insights lay the foundation for future research looking closer at integrating multiple LLMs for unexplored tasks in which selecting only one pre-trained model is unclear or suboptimal.

MultiGA: Leveraging Multi-Source Seeding in Genetic Algorithms

TL;DR

MultiGA introduces a genetic algorithm framework that seeds the initial population with outputs from multiple LLMs and relies on an independent evaluator to score and recombine candidates. This ensemble-based seeding mitigates reliance on any single model and promotes diversity, enabling robust performance across text-to-SQL, planning, graduate science questions, and bias evaluation. The approach demonstrates convergence toward the best-performing single model's accuracy while maintaining stability and resilience to weaker seeds. The study highlights the potential of cross-LLM collaboration and evaluator-driven recombination as a practical direction for tackling interdisciplinary and novel tasks without heavy model selection overhead.

Abstract

Large Language Models (LLMs) are widely used across research domains to tackle complex tasks, but their performance can vary significantly depending on the task at hand. Evolutionary algorithms, inspired by natural selection, can be used to refine solutions iteratively at inference-time. To the best of our knowledge, there has not been exploration on leveraging the collective capabilities of multi-source seeding for LLM-guided genetic algorithms. In this paper, we introduce a novel approach, MultiGA, which applies genetic algorithm principles to address complex natural language tasks and reasoning problems by sampling from a diverse population of LLMs to initialize the population. MultiGA generates a range of outputs from various parent LLMs, open source and closed source, and uses a neutral fitness function to evaluate them. Through an iterative recombination process, we mix and refine these generations until an optimal solution is achieved. We benchmark our approach using text-to-SQL code generation tasks, trip planning, GPQA benchmark for grad-level science questions, and the BBQ bias benchmark. Our results show that MultiGA converges to the accuracy of the LLM best fit for the task, and these insights lay the foundation for future research looking closer at integrating multiple LLMs for unexplored tasks in which selecting only one pre-trained model is unclear or suboptimal.

Paper Structure

This paper contains 40 sections, 2 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of the MultiGA framework. Populations are initialized with multiple LLMs, while an independent LLM $E$ handles fitness evaluation (scoring candidates) and recombination (combining two parent solutions). The process terminates once target fitness $\phi$ or maximum number $T$ generations is reached. Then, the top candidate solution is returned.
  • Figure 2: Example of recombination on a text-to-SQL task (“Which publisher published the slowest superhero?”). Parent 1 (gpt-4o, score 0.95) was paired with a randomly selected parent (deepseek-r1, score 0.1). The resulting child achieved a perfect score of 1.0 by preserving gpt-4o’s overall structure while incorporating the MIN aggregation from r1.