Table of Contents
Fetching ...

Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization

Yuchi Liu, Jaskirat Singh, Gaowen Liu, Ali Payani, Liang Zheng

TL;DR

The paper tackles the problem that prompt design quality limits LLM performance and existing prompt optimization methods struggle to generalize without task-specific training. It introduces Hierarchical Multi-Agent Workflow (HMAW), a zero-shot, training-free pipeline where a CEO, Manager, and Worker LLM collaboratively generate a refined prompt and the final answer, leveraging layer-specific contexts and skip connections. Empirical results across five diverse datasets show HMAW improves over no prompting and performs competitively with or surpasses state-of-the-art methods, achieving an average 69.2% preference for HMAW over baselines and 70.3% GSM8K accuracy (+1.7). The approach demonstrates strong generalization across tasks and LLMs, offering a scalable, task-agnostic solution for prompt optimization with practical impact in multi-domain AI systems.

Abstract

Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works have developed many strategies for improving the prompt, including both manual crafting and in-domain optimization. However, their efficacy in unrestricted scenarios remains questionable, as the former depends on human design for specific questions and the latter usually generalizes poorly to unseen scenarios. To address these problems, we give LLMs the freedom to design the best prompts according to themselves. Specifically, we include a hierarchy of LLMs, first constructing a prompt with precise instructions and accurate wording in a hierarchical manner, and then using this prompt to generate the final answer to the user query. We term this pipeline Hierarchical Multi-Agent Workflow, or HMAW. In contrast with prior works, HMAW imposes no human restriction and requires no training, and is completely task-agnostic while capable of adjusting to the nuances of the underlying task. Through both quantitative and qualitative experiments across multiple benchmarks, we verify that despite its simplicity, the proposed approach can create detailed and suitable prompts, further boosting the performance of current LLMs.

Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization

TL;DR

The paper tackles the problem that prompt design quality limits LLM performance and existing prompt optimization methods struggle to generalize without task-specific training. It introduces Hierarchical Multi-Agent Workflow (HMAW), a zero-shot, training-free pipeline where a CEO, Manager, and Worker LLM collaboratively generate a refined prompt and the final answer, leveraging layer-specific contexts and skip connections. Empirical results across five diverse datasets show HMAW improves over no prompting and performs competitively with or surpasses state-of-the-art methods, achieving an average 69.2% preference for HMAW over baselines and 70.3% GSM8K accuracy (+1.7). The approach demonstrates strong generalization across tasks and LLMs, offering a scalable, task-agnostic solution for prompt optimization with practical impact in multi-domain AI systems.

Abstract

Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works have developed many strategies for improving the prompt, including both manual crafting and in-domain optimization. However, their efficacy in unrestricted scenarios remains questionable, as the former depends on human design for specific questions and the latter usually generalizes poorly to unseen scenarios. To address these problems, we give LLMs the freedom to design the best prompts according to themselves. Specifically, we include a hierarchy of LLMs, first constructing a prompt with precise instructions and accurate wording in a hierarchical manner, and then using this prompt to generate the final answer to the user query. We term this pipeline Hierarchical Multi-Agent Workflow, or HMAW. In contrast with prior works, HMAW imposes no human restriction and requires no training, and is completely task-agnostic while capable of adjusting to the nuances of the underlying task. Through both quantitative and qualitative experiments across multiple benchmarks, we verify that despite its simplicity, the proposed approach can create detailed and suitable prompts, further boosting the performance of current LLMs.
Paper Structure (18 sections, 1 equation, 13 figures, 4 tables)

This paper contains 18 sections, 1 equation, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Examples comparing the generalization ability of existing methods and the proposed one. (a) COT wei2022chain uses a handcrafted prompt, which might not be suitable for all tasks. (b) APE zhou2023large fine-tunes the prompt on a specific dataset, and its generalization capability to other scenarios is questionable. (c) ExperPrompting xu2023expertprompting includes few-shot examples in the system prompt to help an LLM convert the user query to a format more suitable for LLM, but these examples might not be able to cover all scenarios. (d) Our method adopts a hierarchical design in reformatting the user query. Free from pre-defined few-shot examples, the interaction between the hierarchy allows for more generalizable yet more adaptive tuning of prompts.
  • Figure 2: Method Overview. We propose modeling the prompt optimization problem as a zero-shot output within a multi-agent workflow. The initial query, $q_i$, is first inputted into the first layer of our framework (the COE layer). Before being processed by the CEO LLM agent, $q_i$ is transformed into an LLM prompt $p_i^c$ by the prompter $f^c$, which also concatenates it with the context $C^c$ in the CEO layer. The output of the first layer, $q_i^c$, serves as the query from the CEO layer to the Manager layer. Similarly, the Manager Layer and the Worker Layer each include their own prompters, $f^m$ and $f^w$, respectively. Besides concatenating the content of this layer, the initial query $q_i$ is also concatenated to enhance stability. The input for the Worker LLM is our optimized prompt $P_i^*$, which directly triggers the LLM agent to generate the final response to the original query $q_i$.
  • Figure 3: Ablation studies. We remove various components from the full system. Preference score (%) is reported on three datasets: Education (left), CodeNet (middle), and FED (right). These components include the two skip connections, the CEO layer and the Manager layer. It is observed that removing these components one or two at a time leads to performance drop.
  • Figure 4: Impact of the number of workflow layers. The preference scores for responses generated using HMAW with different numbers of layers (1-6) are reported on the Education dataset. GPT-3.5 is used as the LLM agent. Three layers (CEO-Manager-Worker) yields the highest preference score.
  • Figure 5: An example of prompt optimization using HMAW on the Education dataset. A student poses a question about genetics. CEO creates an instruction to Manager, who then generates instructions to Worker. Based on this, Worker delivers the final response. Colored texts indicate content coherence: green highlights the user background, purple focuses on the topic of the task, orange indicates the proposed solution, red highlights solution details, and blue emphasizes the tone of the response. These layers and cross-layer correspondences demonstrate how HMAW effectively optimizes the prompt and responds to specific user needs. Each step adds specificity and applicability, allowing for accurate and intuitive LLM responses.
  • ...and 8 more figures