Table of Contents
Fetching ...

Guiding Large Language Models via Directional Stimulus Prompting

Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, Xifeng Yan

TL;DR

This work introduces Directional Stimulus Prompting (DSP), a framework that guides black-box LLMs by generating instance-specific directional prompts with a small, tunable policy. DSP leverages supervised fine-tuning and reinforcement learning (via PPO NLPO) to optimize a directional stimulus that steers outputs without modifying the LLM parameters. Across summarization, task-oriented dialogue, and chain-of-thought reasoning, DSP yields consistent gains, including substantial improvements on MultiWOZ with very limited data and notable improvements in CoT reasoning over human-crafted prompts. The approach offers data-efficient, interpretable, and flexible control of LLM outputs and opens avenues for discovering compact, task-specific “stimuli” that align LLM behavior with desired targets.

Abstract

We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model (e.g., T5) to generate an auxiliary directional stimulus prompt for each input instance. These directional stimulus prompts act as nuanced, instance-specific hints and clues to guide LLMs in generating desired outcomes, such as including specific keywords in the generated summary. Our approach sidesteps the challenges of direct LLM tuning by optimizing the policy model to explore directional stimulus prompts that align LLMs with desired behaviors. The policy model can be optimized through 1) supervised fine-tuning using labeled data and 2) reinforcement learning from offline or online rewards based on the LLM's output. We assess our method across summarization, dialogue response generation, and chain-of-thought reasoning tasks. Our experiments demonstrate that the framework consistently improves LLMs' (e.g., ChatGPT, Codex, InstructGPT) performance on these supervised tasks using minimal labeled data. Notably, using just 80 dialogues on the MultiWOZ dataset, our approach enhances ChatGPT's performance by an impressive 41.4%, matching or surpassing some fully supervised start-of-the-art models. Additionally, the instance-specific chain-of-thought prompt generated by our approach improves InstructGPT's reasoning accuracy compared to human-crafted or automatically generated prompts. The code and data are publicly available at \url{https://github.com/Leezekun/Directional-Stimulus-Prompting}.

Guiding Large Language Models via Directional Stimulus Prompting

TL;DR

This work introduces Directional Stimulus Prompting (DSP), a framework that guides black-box LLMs by generating instance-specific directional prompts with a small, tunable policy. DSP leverages supervised fine-tuning and reinforcement learning (via PPO NLPO) to optimize a directional stimulus that steers outputs without modifying the LLM parameters. Across summarization, task-oriented dialogue, and chain-of-thought reasoning, DSP yields consistent gains, including substantial improvements on MultiWOZ with very limited data and notable improvements in CoT reasoning over human-crafted prompts. The approach offers data-efficient, interpretable, and flexible control of LLM outputs and opens avenues for discovering compact, task-specific “stimuli” that align LLM behavior with desired targets.

Abstract

We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model (e.g., T5) to generate an auxiliary directional stimulus prompt for each input instance. These directional stimulus prompts act as nuanced, instance-specific hints and clues to guide LLMs in generating desired outcomes, such as including specific keywords in the generated summary. Our approach sidesteps the challenges of direct LLM tuning by optimizing the policy model to explore directional stimulus prompts that align LLMs with desired behaviors. The policy model can be optimized through 1) supervised fine-tuning using labeled data and 2) reinforcement learning from offline or online rewards based on the LLM's output. We assess our method across summarization, dialogue response generation, and chain-of-thought reasoning tasks. Our experiments demonstrate that the framework consistently improves LLMs' (e.g., ChatGPT, Codex, InstructGPT) performance on these supervised tasks using minimal labeled data. Notably, using just 80 dialogues on the MultiWOZ dataset, our approach enhances ChatGPT's performance by an impressive 41.4%, matching or surpassing some fully supervised start-of-the-art models. Additionally, the instance-specific chain-of-thought prompt generated by our approach improves InstructGPT's reasoning accuracy compared to human-crafted or automatically generated prompts. The code and data are publicly available at \url{https://github.com/Leezekun/Directional-Stimulus-Prompting}.
Paper Structure (30 sections, 7 equations, 12 figures, 10 tables)

This paper contains 30 sections, 7 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Comparison of our Directional Stimulus Prompting and the standard prompting method using LLMs such as ChatGPT for the summarization task. DSP utilizes directional stimulus/hints (highlighted in orange), which are keywords in this case, to provide instance-specific guidance to LLMs in generating summaries (highlighted in blue) that better align with the desired reference summary with higher ROUGE scores or other measures like human preferences.
  • Figure 2: Overview of our proposed framework DSP, where we learn a small tunable policy model to generate the directional stimulus (keywords in this case) that provide input-specific guidance for the LLM toward the desired target. The policy model can be trained with SFT and/or RL, where the reward is defined as the downstream task performance measure, such as the ROUGE score for the summarization task, or other alignment measures like human preferences.
  • Figure 3: Performance comparison of ChatGPT with standard prompting and DSP trained with SFT and SFT+RL, using varying numbers of training samples from the CNN/Daily Mail dataset.
  • Figure 4: Training curve on 1000 samples from the CNN/Daily Mail dataset.
  • Figure 5: Number of generated keywords, keyword precision, and summary ROUGE-1 during the training process on 4000 samples.
  • ...and 7 more figures