Table of Contents
Fetching ...

Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors

Fan Nie, Lan Feng, Haotian Ye, Weixin Liang, Pan Lu, Huaxiu Yao, Alexandre Alahi, James Zou

TL;DR

The paper tackles the challenge of leveraging powerful, often expensive-to-fine-tune LLMs by introducing Weak-for-Strong Harnessing (W4S), which trains a compact weak meta-agent to design agentic workflows that utilize strong executors. Framed as a multi-turn Markov decision process ($MDP$), and optimized via Reinforcement Learning for Agentic Workflow Optimization (RLAO), W4S enables a 7B meta-agent to generate and refine workflows with environment feedback, without touching the strong models directly. Empirical results across eleven benchmarks show substantial gains (2.9% to 24.6%) over baselines, with robust generalization to unseen tasks and cross-model transfer, while training costs remain modest (about one GPU hour) and test-time costs are low. The approach offers a scalable, controllable alternative to fine-tuning, unlocking latent capabilities of strong executors through learned, task-specific workflow design and coordination.

Abstract

Efficiently leveraging of the capabilities of contemporary large language models (LLMs) is increasingly challenging, particularly when direct fine-tuning is expensive and often impractical. Existing training-free methods, including manually or automated designed workflows, typically demand substantial human effort or yield suboptimal results. This paper proposes Weak-for-Strong Harnessing (W4S), a novel framework that customizes smaller, cost-efficient language models to design and optimize workflows for harnessing stronger models. W4S formulates workflow design as a multi-turn markov decision process and introduces reinforcement learning for agentic workflow optimization (RLAO) to train a weak meta-agent. Through iterative interaction with the environment, the meta-agent learns to design increasingly effective workflows without manual intervention. Empirical results demonstrate the superiority of W4S that our 7B meta-agent, trained with just one GPU hour, outperforms the strongest baseline by 2.9% ~ 24.6% across eleven benchmarks, successfully elevating the performance of state-of-the-art models such as GPT-3.5-Turbo and GPT-4o. Notably, W4S exhibits strong generalization capabilities across both seen and unseen tasks, offering an efficient, high-performing alternative to directly fine-tuning strong models.

Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors

TL;DR

The paper tackles the challenge of leveraging powerful, often expensive-to-fine-tune LLMs by introducing Weak-for-Strong Harnessing (W4S), which trains a compact weak meta-agent to design agentic workflows that utilize strong executors. Framed as a multi-turn Markov decision process (), and optimized via Reinforcement Learning for Agentic Workflow Optimization (RLAO), W4S enables a 7B meta-agent to generate and refine workflows with environment feedback, without touching the strong models directly. Empirical results across eleven benchmarks show substantial gains (2.9% to 24.6%) over baselines, with robust generalization to unseen tasks and cross-model transfer, while training costs remain modest (about one GPU hour) and test-time costs are low. The approach offers a scalable, controllable alternative to fine-tuning, unlocking latent capabilities of strong executors through learned, task-specific workflow design and coordination.

Abstract

Efficiently leveraging of the capabilities of contemporary large language models (LLMs) is increasingly challenging, particularly when direct fine-tuning is expensive and often impractical. Existing training-free methods, including manually or automated designed workflows, typically demand substantial human effort or yield suboptimal results. This paper proposes Weak-for-Strong Harnessing (W4S), a novel framework that customizes smaller, cost-efficient language models to design and optimize workflows for harnessing stronger models. W4S formulates workflow design as a multi-turn markov decision process and introduces reinforcement learning for agentic workflow optimization (RLAO) to train a weak meta-agent. Through iterative interaction with the environment, the meta-agent learns to design increasingly effective workflows without manual intervention. Empirical results demonstrate the superiority of W4S that our 7B meta-agent, trained with just one GPU hour, outperforms the strongest baseline by 2.9% ~ 24.6% across eleven benchmarks, successfully elevating the performance of state-of-the-art models such as GPT-3.5-Turbo and GPT-4o. Notably, W4S exhibits strong generalization capabilities across both seen and unseen tasks, offering an efficient, high-performing alternative to directly fine-tuning strong models.

Paper Structure

This paper contains 26 sections, 10 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison of paradigms: Weak-to-Strong Generalization uses weak models to supervise strong models, akin to superalignment; routing-based methods train weak models to dispatch queries across strong models; in contrast, Weak-for-Strong Harnessing (W4S) trains a weak model to optimize a strong model’s performance on a specific task.
  • Figure 2: (a) The weak meta-agent harness strong models by optimizing the workflows iteratively based on task and environment feedback. (b) To collect effective data for offline RL training, the meta-agent will sample $m$ times in each iteration, and using the best samples to form the next state. The data form multi-turn trajectories for offline RL training.
  • Figure 3: Ablation Studies on MGSM and GSM8K. The purple line represents the performance of W4S using 7B model trained on MGSM and GSM Plus with RLAO.
  • Figure 4: Cost Analysis (a) and Case Studies (b, c) of W4S on different benchmarks.
  • Figure 5: The Test Accuracy (%) of ADAS on MGSM dataset. 'Sequential' denotes the default configuration, updating the history archive iteratively; 'Random' indicates 30 independent workflow samples generated in the first iteration. Results show that ADAS’s sequential performance closely mirrors random sampling, with its maximum accuracy not exceeding the best random sample.