Table of Contents
Fetching ...

Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs

Changhao Li, Yuchen Zhuang, Rushi Qiang, Haotian Sun, Hanjun Dai, Chao Zhang, Bo Dai

TL;DR

Matryoshka Pilot (M-Pilot) introduces a lightweight white-box LLM controller that drives a black-box LLM by generating intermediate guidance and treating the black-box generator as an interactive environment for multi-turn problem solving. It formalizes the setup as an MDp and uses Iterative Direct Preference Optimization (IDPO) with a Bradley–Terry preference model and KL-regularized planning to continually improve guidance without accessing the black-box parameters, achieving self-improvement through feedback. Across personalization (LaMP), reasoning (GSM8K), and planning (ALFWorld), M-Pilot delivers consistent improvements over strong baselines and supports plug‑and‑play deployment with different black-box models, demonstrating data efficiency and transferability. The work highlights the potential of a transparent, scalable controller–environment framework to enhance long-horizon tasks in black-box LLMs while acknowledging societal and safety considerations.

Abstract

Despite the impressive generative abilities of black-box large language models (LLMs), their inherent opacity hinders further advancements in capabilities such as reasoning, planning, and personalization. Existing works aim to enhance LLM capabilities via domain-specific adaptation, which require additional training on accessible model parameters, an infeasible option for black-box LLMs. To address this challenge, we introduce Matryoshka Pilot (M-Pilot), a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs. Specifically, we consider the black-box LLM as an environment, with M-Pilot serving as a policy to provide intermediate guidance through prompts for driving the black-box LLM. M-Pilot is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction, which enables controllable multi-turn generation and self-improvement in optimizing intermediate guidance. Empirical evaluations on diverse tasks demonstrate that our method effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks. Our code is publicly available at: https://github.com/lichangh20/Matryoshka.

Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs

TL;DR

Matryoshka Pilot (M-Pilot) introduces a lightweight white-box LLM controller that drives a black-box LLM by generating intermediate guidance and treating the black-box generator as an interactive environment for multi-turn problem solving. It formalizes the setup as an MDp and uses Iterative Direct Preference Optimization (IDPO) with a Bradley–Terry preference model and KL-regularized planning to continually improve guidance without accessing the black-box parameters, achieving self-improvement through feedback. Across personalization (LaMP), reasoning (GSM8K), and planning (ALFWorld), M-Pilot delivers consistent improvements over strong baselines and supports plug‑and‑play deployment with different black-box models, demonstrating data efficiency and transferability. The work highlights the potential of a transparent, scalable controller–environment framework to enhance long-horizon tasks in black-box LLMs while acknowledging societal and safety considerations.

Abstract

Despite the impressive generative abilities of black-box large language models (LLMs), their inherent opacity hinders further advancements in capabilities such as reasoning, planning, and personalization. Existing works aim to enhance LLM capabilities via domain-specific adaptation, which require additional training on accessible model parameters, an infeasible option for black-box LLMs. To address this challenge, we introduce Matryoshka Pilot (M-Pilot), a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs. Specifically, we consider the black-box LLM as an environment, with M-Pilot serving as a policy to provide intermediate guidance through prompts for driving the black-box LLM. M-Pilot is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction, which enables controllable multi-turn generation and self-improvement in optimizing intermediate guidance. Empirical evaluations on diverse tasks demonstrate that our method effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks. Our code is publicly available at: https://github.com/lichangh20/Matryoshka.

Paper Structure

This paper contains 60 sections, 15 equations, 7 figures, 22 tables.

Figures (7)

  • Figure 1: Enhancement in black-box LLMs capabilities. Existing methods either (a) integrate well-crafted instructions or meticulously-picked few-shot demonstrations as guidance or (b) exploit randomness in model generations to identify the most promising solution from candidates. In M-Pilot, we present (c) a controller-generator framework that enables white-box LLMs to drive the behavior of black-box LLMs for enhanced capabilities. indicates the trainable parameters, whereas indicates the inaccessible fixed parameters.
  • Figure 2: Controller-generator framework in M-Pilot comprising a white-box LLM as the controller and a black-box LLM as the generator and part of the environment. Given an input query $x$, M-Pilot leverages the intermediate generation $f_\theta(x)$ from the controller $\theta$ to drive the generator's behavior. The final answer is derived from the generation $y\sim g_{\text{LLM}}(f_\theta(x))$.
  • Figure 3: Examples of intermediate guidance generated by M-Pilot for complex reasoning, planning, and personalization tasks.
  • Figure 4: Overview of iterative guidance optimization. By iteratively updating both the model and the reference policy, M-Pilot progressively refines its intermediate guidance.
  • Figure 5: Success rate (%) w.r.t number of (a) training samples and (b) inner-loop interaction turns.
  • ...and 2 more figures