Table of Contents
Fetching ...

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

Parth Asawa, Alan Zhu, Matei Zaharia, Alexandros G. Dimakis, Joseph E. Gonzalez

TL;DR

This work tackles the rigidity of static prompts for black-box foundation models by introducing Advisor Models: lightweight policies trained with reinforcement learning to emit per-instance natural language advice that is injected in-context to steer a frozen student model. The approach leverages GRPO to optimize advice based on task-specific rewards without requiring access to the student’s weights, enabling environment-specific memory and personalization. Across review-writing, math-solution generation, low-resource translation, and complex rule-following tasks, Advisor Models outperform static baselines, demonstrate transferability across different black-box models, and preserve the student’s general capabilities, addressing concerns about catastrophic forgetting. The results highlight the potential of dynamic, interpretable prompt policies to unlock environment-adaptive AI while maintaining robustness to out-of-distribution inputs and enabling practical deployment via API-accessible models.

Abstract

Foundation models are increasingly deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. While static prompt optimization has shown promise, it produces a single fixed prompt that fails to adapt to different inputs, users, or environments. We introduce Advisor Models, lightweight parametric policies trained with reinforcement learning to reactively issue natural language steering instructions in-context to black-box models. The advisor is a second small model that sits between the input and the model, shaping behavior on a per-instance basis using reward signals from the environment. Across multiple domains involving reasoning and personalization, we show that Advisor Models outperform static prompt optimizers, discovering environment dynamics and improving downstream task performance. We also demonstrate the generalizability of advisors by transferring them across black-box models, as well as the framework's ability to achieve specialization while retaining robustness to out-of-distribution inputs. Viewed more broadly, Advisor Models provide a learnable interface to black-box systems where the advisor acts as a parametric, environment-specific memory. We argue that dynamic optimization of black-box models via Advisor Models is a promising direction for enabling personalization and environment-adaptable AI with frontier-level capabilities.

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

TL;DR

This work tackles the rigidity of static prompts for black-box foundation models by introducing Advisor Models: lightweight policies trained with reinforcement learning to emit per-instance natural language advice that is injected in-context to steer a frozen student model. The approach leverages GRPO to optimize advice based on task-specific rewards without requiring access to the student’s weights, enabling environment-specific memory and personalization. Across review-writing, math-solution generation, low-resource translation, and complex rule-following tasks, Advisor Models outperform static baselines, demonstrate transferability across different black-box models, and preserve the student’s general capabilities, addressing concerns about catastrophic forgetting. The results highlight the potential of dynamic, interpretable prompt policies to unlock environment-adaptive AI while maintaining robustness to out-of-distribution inputs and enabling practical deployment via API-accessible models.

Abstract

Foundation models are increasingly deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. While static prompt optimization has shown promise, it produces a single fixed prompt that fails to adapt to different inputs, users, or environments. We introduce Advisor Models, lightweight parametric policies trained with reinforcement learning to reactively issue natural language steering instructions in-context to black-box models. The advisor is a second small model that sits between the input and the model, shaping behavior on a per-instance basis using reward signals from the environment. Across multiple domains involving reasoning and personalization, we show that Advisor Models outperform static prompt optimizers, discovering environment dynamics and improving downstream task performance. We also demonstrate the generalizability of advisors by transferring them across black-box models, as well as the framework's ability to achieve specialization while retaining robustness to out-of-distribution inputs. Viewed more broadly, Advisor Models provide a learnable interface to black-box systems where the advisor acts as a parametric, environment-specific memory. We argue that dynamic optimization of black-box models via Advisor Models is a promising direction for enabling personalization and environment-adaptable AI with frontier-level capabilities.

Paper Structure

This paper contains 26 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Example of the Advisor Models workflow. The input task is given first to the advisor model to generate advice. In this example, the task is personalization and the advisor feedback is specific to the user (i.e., Matei). The input task and the advice are then given to a strong black-box model (e.g. GPT-5) to generate the final output. The advisor learns via RL the preferences of the user. Crucially, advisor model RL training can be done with only API access to the black-box model.
  • Figure 2: Advisor Models combine open-source models with black box models.Advisor Models generate instance-specific advice that is injected in-context to steer a black-box model. Rewards from the environment of the final output are used in GRPO to reinforce effective advice.
  • Figure 3:
  • Figure 4: Strong initialization leads to faster learning.Advisor Models learning curve on the review length domain, under strong and weak initialization, 5 epochs. Generated advice improves in both settings but training with strong initialization leads to better final performance.
  • Figure 5: Weak initialization can eventually learn.Advisor Models learning curve on the review length domain with weak initialization for 30 epochs, with strong initialization performance after 5 epochs provided as reference. Performance eventually reaches the reference after extended training.
  • ...and 2 more figures