MultiGA: Leveraging Multi-Source Seeding in Genetic Algorithms
Isabelle Diana May-Xin Ng, Tharindu Cyril Weerasooriya, Haitao Zhu, Wei Wei
TL;DR
MultiGA introduces a genetic algorithm framework that seeds the initial population with outputs from multiple LLMs and relies on an independent evaluator to score and recombine candidates. This ensemble-based seeding mitigates reliance on any single model and promotes diversity, enabling robust performance across text-to-SQL, planning, graduate science questions, and bias evaluation. The approach demonstrates convergence toward the best-performing single model's accuracy while maintaining stability and resilience to weaker seeds. The study highlights the potential of cross-LLM collaboration and evaluator-driven recombination as a practical direction for tackling interdisciplinary and novel tasks without heavy model selection overhead.
Abstract
Large Language Models (LLMs) are widely used across research domains to tackle complex tasks, but their performance can vary significantly depending on the task at hand. Evolutionary algorithms, inspired by natural selection, can be used to refine solutions iteratively at inference-time. To the best of our knowledge, there has not been exploration on leveraging the collective capabilities of multi-source seeding for LLM-guided genetic algorithms. In this paper, we introduce a novel approach, MultiGA, which applies genetic algorithm principles to address complex natural language tasks and reasoning problems by sampling from a diverse population of LLMs to initialize the population. MultiGA generates a range of outputs from various parent LLMs, open source and closed source, and uses a neutral fitness function to evaluate them. Through an iterative recombination process, we mix and refine these generations until an optimal solution is achieved. We benchmark our approach using text-to-SQL code generation tasks, trip planning, GPQA benchmark for grad-level science questions, and the BBQ bias benchmark. Our results show that MultiGA converges to the accuracy of the LLM best fit for the task, and these insights lay the foundation for future research looking closer at integrating multiple LLMs for unexplored tasks in which selecting only one pre-trained model is unclear or suboptimal.
