Self-Prompt Tuning: Enable Autonomous Role-Playing in LLMs

Aobo Kong; Shiwan Zhao; Hao Chen; Qicheng Li; Yong Qin; Ruiqi Sun; Xin Zhou; Jiaming Zhou; Haoqin Sun

Self-Prompt Tuning: Enable Autonomous Role-Playing in LLMs

Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, Xin Zhou, Jiaming Zhou, Haoqin Sun

TL;DR

This work addresses the challenge of manually crafting expert role prompts for LLMs by introducing self-prompt tuning, which enables autonomous role-playing through parameter updates rather than external prompts. The authors construct LIMA-Role by augmenting a small instruction-tuning corpus with GPT-4 generated role prompts and then fine-tune Mistral-7B and Llama-2-7B to generate prompts for new questions. Evaluations across eight NLP benchmarks and an open-ended test show that self-prompt tuned models generally outperform standard instruction-tuned baselines, though they lag behind official or ChatGPT baselines on some open-ended tasks due to data scale. They release the LIMA-Role dataset, the fine-tuned models, and code to promote automation of prompting strategies and inspire future work on broader prompting techniques.

Abstract

Recent advancements in LLMs have showcased their remarkable role-playing capabilities, able to accurately simulate the dialogue styles and cognitive processes of various roles based on different instructions and contexts. Studies indicate that assigning LLMs the roles of experts, a strategy known as role-play prompting, can enhance their performance in the corresponding domains. However, the prompt needs to be manually designed for the given problem, requiring certain expertise and iterative modifications. To this end, we propose self-prompt tuning, making LLMs themselves generate role-play prompts through fine-tuning. Leveraging the LIMA dataset as our foundational corpus, we employ GPT-4 to annotate role-play prompts for each data points, resulting in the creation of the LIMA-Role dataset. We then fine-tune LLMs like Llama-2-7B and Mistral-7B on LIMA-Role. Consequently, the self-prompt tuned LLMs can automatically generate expert role prompts for any given question. We extensively evaluate self-prompt tuned LLMs on widely used NLP benchmarks and open-ended question test. Our empirical results illustrate that self-prompt tuned LLMs outperform standard instruction tuned baselines across most datasets. This highlights the great potential of utilizing fine-tuning to enable LLMs to self-prompt, thereby automating complex prompting strategies. We release the dataset, models, and code at this \href{https://anonymous.4open.science/r/Self-Prompt-Tuning-739E/}{url}.

Self-Prompt Tuning: Enable Autonomous Role-Playing in LLMs

TL;DR

Abstract

Paper Structure (15 sections, 5 figures, 3 tables)

This paper contains 15 sections, 5 figures, 3 tables.

Introduction
Related Work
Instruction Tuning
Role-playing Abilities of LLMs
Prompting Strategies
Self-Prompt Tuning
Construct LIMA-Role Dataset
Fine-tune LLMs on LIMA-Role
Experiments
Tasks and Datasets
Experimental Setup
Results on NLP Benchmarks
Results on Open-ended Questions
Ablation Study
Conclusion

Figures (5)

Figure 1: Examples of standard instruction tuned LLM, instruction tuned LLM with manual role-play prompting, and self-prompt tuned LLM on the same physics question. Manual and automatic role-play prompts are highlighted in gray and blue respectively. LLM used here is Mistral-7B.
Figure 2: An illustration of LIMA-Role dataset construction process. The upper sub-image displays the prompt used for GPT-4 role-play prompt annotation. The lower sub-image shows how role-play prompts are utilized to construct LIMA-Role. The question to be annotated and the corresponding role-play prompts generated by GPT-4 are highlighted in gray and blue, respectively.
Figure 3: The performance comparison between Mistral-LIMA and Mistral-Role across various domain-specific subsets in MMLU. Mistral-Role outperforms Mistral-LIMA in 9 out of 10 domains and underperforms in chemistry.
Figure 4: Word clouds based on roles generated by Mistral-Role across domain-specific subsets in MMLU. Words characterized by larger font sizes and deeper color correspond to higher frequencies.
Figure 5: Preference evaluation on LIMA test set using GPT-4 as the annotator. In this context, LIMA refers to Mistral-LIMA, while Role denotes Mistral-Role.

Self-Prompt Tuning: Enable Autonomous Role-Playing in LLMs

TL;DR

Abstract

Self-Prompt Tuning: Enable Autonomous Role-Playing in LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)