Table of Contents
Fetching ...

Prompt Tuning with Diffusion for Few-Shot Pre-trained Policy Generalization

Shengchao Hu, Wanru Zhao, Weixiong Lin, Li Shen, Ya Zhang, Dacheng Tao

TL;DR

This work reframe prompt-tuning as conditional generative modeling, where prompts are generated from random noise, and proposes Prompt Diffuser, which employs a conditional diffusion model to generate high-quality prompts.

Abstract

Offline reinforcement learning (RL) methods harness previous experiences to derive an optimal policy, forming the foundation for pre-trained large-scale models (PLMs). When encountering tasks not seen before, PLMs often utilize several expert trajectories as prompts to expedite their adaptation to new requirements. Though a range of prompt-tuning methods have been proposed to enhance the quality of prompts, these methods often face optimization restrictions due to prompt initialization, which can significantly constrain the exploration domain and potentially lead to suboptimal solutions. To eliminate the reliance on the initial prompt, we shift our perspective towards the generative model, framing the prompt-tuning process as a form of conditional generative modeling, where prompts are generated from random noise. Our innovation, the Prompt Diffuser, leverages a conditional diffusion model to produce prompts of exceptional quality. Central to our framework is the approach to trajectory reconstruction and the meticulous integration of downstream task guidance during the training phase. Further experimental results underscore the potency of the Prompt Diffuser as a robust and effective tool for the prompt-tuning process, demonstrating strong performance in the meta-RL tasks.

Prompt Tuning with Diffusion for Few-Shot Pre-trained Policy Generalization

TL;DR

This work reframe prompt-tuning as conditional generative modeling, where prompts are generated from random noise, and proposes Prompt Diffuser, which employs a conditional diffusion model to generate high-quality prompts.

Abstract

Offline reinforcement learning (RL) methods harness previous experiences to derive an optimal policy, forming the foundation for pre-trained large-scale models (PLMs). When encountering tasks not seen before, PLMs often utilize several expert trajectories as prompts to expedite their adaptation to new requirements. Though a range of prompt-tuning methods have been proposed to enhance the quality of prompts, these methods often face optimization restrictions due to prompt initialization, which can significantly constrain the exploration domain and potentially lead to suboptimal solutions. To eliminate the reliance on the initial prompt, we shift our perspective towards the generative model, framing the prompt-tuning process as a form of conditional generative modeling, where prompts are generated from random noise. Our innovation, the Prompt Diffuser, leverages a conditional diffusion model to produce prompts of exceptional quality. Central to our framework is the approach to trajectory reconstruction and the meticulous integration of downstream task guidance during the training phase. Further experimental results underscore the potency of the Prompt Diffuser as a robust and effective tool for the prompt-tuning process, demonstrating strong performance in the meta-RL tasks.

Paper Structure

This paper contains 32 sections, 1 theorem, 18 equations, 6 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Assume $\mathcal{L}_1$ and $\mathcal{L}_2$ are convex and differentiable. Suppose the gradient of $\mathcal{L}$ is $L$-Lipschitz with $L > 0$. Then, the gradient projection technique with step size $t \leq \frac{1}{L}$ will converge to either (1) a location in the optimization landscape where $\cos(

Figures (6)

  • Figure 1: Overall architecture of Prompt Diffuser. Diffuser samples transitions conditioned on the return-to-go and timestep tokens, which construct a prompt for the PLM. The loss between predicted and actual actions guides the denoising process, enhancing the quality of the generated prompts.
  • Figure 2: (a) The figure of the prompt updating process, where the prompt is treated as a point, and the updating process is simplified to identify the minimum value. (b) The performance of various methods (SP. refers to Soft Prompt) under different prompt initializations within the Cheetah-vel task.
  • Figure 3: Ablation on diffusion guidance. We establish the performance of Equation \ref{['ab:dm']} as the baseline and subsequently present the relative performance. Remarkably, our gradient projection technique consistently yields the most favorable outcomes.
  • Figure 4: The visual results of the reverse denoising process. The initial samples $x^0(\tau)$ originate from Gaussian noise, with a prompt length of 5, represented by distinct colors.
  • Figure 5: The t-SNE visualization of different prompts in the Ant-dir-OOD environment.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Theorem 1