Crafting a Good Prompt or Providing Exemplary Dialogues? A Study of In-Context Learning for Persona-based Dialogue Generation

Jiashu Pu; Yajing Wan; Yuru Zhang; Jing Chen; Ling Cheng; Qian Shao; Yongzhu Chang; Tangjie Lv; Rongsheng Zhang

Crafting a Good Prompt or Providing Exemplary Dialogues? A Study of In-Context Learning for Persona-based Dialogue Generation

Jiashu Pu, Yajing Wan, Yuru Zhang, Jing Chen, Ling Cheng, Qian Shao, Yongzhu Chang, Tangjie Lv, Rongsheng Zhang

TL;DR

Problem: Can in-context learning improve persona-based dialogue generation with real human dialogues? Approach: Systematically evaluate LLMs with varied prompts, demo retrieval methods, and demonstration corruption, using robust evaluation metrics including a learned Response Evaluator. Contributions: Clear empirical guidance that prompt design is most cost-effective, random demo retrieval yields best results due to diversity, and models can learn from corrupted demos, challenging explanations based on simple n-gram induction. Significance: Provides practical prescriptions for cost-efficient persona dialogue generation and offers insights into ICL mechanisms beyond token copying, with potential applicability across languages.

Abstract

Previous in-context learning (ICL) research has focused on tasks such as classification, machine translation, text2table, etc., while studies on whether ICL can improve human-like dialogue generation are scarce. Our work fills this gap by systematically investigating the ICL capabilities of large language models (LLMs) in persona-based dialogue generation, conducting extensive experiments on high-quality real human Chinese dialogue datasets. From experimental results, we draw three conclusions: 1) adjusting prompt instructions is the most direct, effective, and economical way to improve generation quality; 2) randomly retrieving demonstrations (demos) achieves the best results, possibly due to the greater diversity and the amount of effective information; counter-intuitively, retrieving demos with a context identical to the query performs the worst; 3) even when we destroy the multi-turn associations and single-turn semantics in the demos, increasing the number of demos still improves dialogue performance, proving that LLMs can learn from corrupted dialogue demos. Previous explanations of the ICL mechanism, such as $n$-gram induction head, cannot fully account for this phenomenon.

Crafting a Good Prompt or Providing Exemplary Dialogues? A Study of In-Context Learning for Persona-based Dialogue Generation

TL;DR

Abstract

-gram induction head, cannot fully account for this phenomenon.

Paper Structure (29 sections, 14 figures, 13 tables)

This paper contains 29 sections, 14 figures, 13 tables.

Introduction
Problem Formulation
Evaluation Metrics for Generation
Different Prompt and ICL Settings
Experimental Settings
Evaluation LLMs & Dataset
Ablation Settings
Results Analysis
Connections between our experimental conclusions and previous work
Limitation
Appendix
More Details on Experimental Settings
Evaluation Metrics for Generation
Prompt Selection Process
Examples of Filled Templates
...and 14 more sections

Figures (14)

Figure 1: X-axis: value 0 represents the most similar condition, and value 4 represents the least similar condition (out of a total of 5 demos). Y-axis: The average distance between a demo's response and response generated by the LLM under different similar conditions, averaged across different persona settings and contexts. Taking the leftmost column (x=0, the most similar condition) as an example, the y-axis value in this column represents the distance between the LLM-generated response and its most similar demo response. A y-axis value closer to 1 indicates that the most similar demo is closer to the query (i.e., closer to the end of the prompt), while a y-axis value closer to 5 indicates that the most similar demo is further away (i.e., closer to the beginning of the prompt). Sub-figure in the lower right corner: the relationship between the demos' distance and their response similarity. The figure shows that, for all three types of demo retrieval methods, there is no consistent pattern that the closer two demos are, the more similar their responses will be. This result is not surprising for the Same and Random methods, as their demo orders are inherently random in $\mathbf{x}_{demo}$. For the Embedding method, the demos are sorted in ascending order of similarity between the demo context and the query context when constructing the prompt (the more similar to the query, the closer to the end of the prompt), but we have not found that similarity in context leads to similarity in response.
Figure 2: X-axis: length of the demonstration context. Y-axis: the proportion of LLM-generated tokens come from the token set of demonstrations $\mathbf{x}_{demo}$.
Figure 3: X-axis: length of the demonstration context. Y-axis: number of unique tokens in demonstrations' context for different methods.
Figure 4: The impact of label substitution and different semantic corruption methods on diversity, similarity, and response quality when the context length varies while the number of few-shot demonstrations remains fixed ($k=5$).
Figure 5: The performance comparison among Context Only method, Prompt Only method, and using both prompt and demonstration when the context length varies while the number of few-shot demonstrations remains fixed ($k=5$).
...and 9 more figures

Crafting a Good Prompt or Providing Exemplary Dialogues? A Study of In-Context Learning for Persona-based Dialogue Generation

TL;DR

Abstract

Crafting a Good Prompt or Providing Exemplary Dialogues? A Study of In-Context Learning for Persona-based Dialogue Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)