Table of Contents
Fetching ...

Helix: A Dual-Helix Co-Evolutionary Multi-Agent System for Prompt Optimization and Question Reformulation

Kewen Zhu, Liping Yi, Zhiming Zhao, Xiang Li, Qinghua Hu

Abstract

Automated prompt optimization (APO) aims to improve large language model performance by refining prompt instructions. However, existing methods are largely constrained by fixed prompt templates, limited search spaces, or single-sided optimization that treats user questions as immutable inputs. In practice, question formulation and prompt design are inherently interdependent: clearer question structures facilitate focused reasoning and task understanding, while effective prompts reveal better ways to organize and restate queries. Ignoring this coupling fundamentally limits the effectiveness and adaptability of current APO approaches. We propose a unified multi-agent system (Helix) that jointly optimizes question reformulation and prompt instructions through a structured three-stage co-evolutionary framework. Helix integrates (1) planner-guided decomposition that breaks optimization into coupled question-prompt objectives, (2) dual-track co-evolution where specialized agents iteratively refine and critique each other to produce complementary improvements, and (3) strategy-driven question generation that instantiates high-quality reformulations for robust inference. Extensive experiments on 12 benchmarks against 6 strong baselines demonstrate the effectiveness of Helix, achieving up to 3.95% performance improvements across tasks with favorable optimization efficiency.

Helix: A Dual-Helix Co-Evolutionary Multi-Agent System for Prompt Optimization and Question Reformulation

Abstract

Automated prompt optimization (APO) aims to improve large language model performance by refining prompt instructions. However, existing methods are largely constrained by fixed prompt templates, limited search spaces, or single-sided optimization that treats user questions as immutable inputs. In practice, question formulation and prompt design are inherently interdependent: clearer question structures facilitate focused reasoning and task understanding, while effective prompts reveal better ways to organize and restate queries. Ignoring this coupling fundamentally limits the effectiveness and adaptability of current APO approaches. We propose a unified multi-agent system (Helix) that jointly optimizes question reformulation and prompt instructions through a structured three-stage co-evolutionary framework. Helix integrates (1) planner-guided decomposition that breaks optimization into coupled question-prompt objectives, (2) dual-track co-evolution where specialized agents iteratively refine and critique each other to produce complementary improvements, and (3) strategy-driven question generation that instantiates high-quality reformulations for robust inference. Extensive experiments on 12 benchmarks against 6 strong baselines demonstrate the effectiveness of Helix, achieving up to 3.95% performance improvements across tasks with favorable optimization efficiency.
Paper Structure (38 sections, 12 equations, 10 figures, 56 tables, 1 algorithm)

This paper contains 38 sections, 12 equations, 10 figures, 56 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of three strategies for pronoun disambiguation. Left: original question without prompt instructions yields an incorrect answer. Middle: adding a CoT prompt to the original question still fails. Right: Helix jointly optimizes question formulation and prompt instructions, producing the correct prediction.
  • Figure 2: Overview of the Helix framework including $6$ LLM-based agents for joint optimization of question reformulation and prompt instructions. The ① Planner decomposes the task into a sequence of helix objectives, dual-helix co-evolution alternates between ② Prompt-Architect and ③ Question-Architect with ④ Mediator validation, and the ⑤ Question-Generator together with the ⑥ Question-Judge produces validated refined questions, which are paired with the optimized prompt and fed to the target LLM for inference.
  • Figure 3: Accuracy--cost trade-off across 12 tasks for different prompt optimization methods. Bubble size denotes the number of training samples, with Helix achieving the highest accuracy with the fewest API calls using only a single sample.
  • Figure 4: Prompt efficiency (PE) comparison on four representative BBH tasks, where Helix consistently achieves the highest performance per optimization cost.
  • Figure 5: Training-stage dual-helix co-evolution on the Formal Fallacies task. The Planner decomposes the task into sequential helix objectives with coupled question reformulation and prompt refinement goals. Prompt-Architect and Question-Architect iteratively critique and refine each other under Mediator validation, yielding aligned optimized prompts and transferable question reformulation strategies.
  • ...and 5 more figures