Table of Contents
Fetching ...

Investigating Content Planning for Navigating Trade-offs in Knowledge-Grounded Dialogue

Kushal Chawla, Hannah Rashkin, Gaurav Singh Tomar, David Reitter

TL;DR

This work investigates how explicit content planning can help balance attribution to evidence with conversational specificity in knowledge-grounded dialogue. It introduces PLEDGE (Plan-Edit-Generate), a framework with a planning-enabled generation model and a plan editor that refines plans toward quality estimators, exploring both structural and keyword plan formats. Experiments on Wizard of Wikipedia show that metric-aware planning improves automatic evaluation metrics but often degrades human judgments, highlighting a misalignment between automatic metrics and human perception. The study underscores the need for better-calibrated evaluation metrics and informs future research on planning strategies for grounded dialogue with careful consideration of human-centric evaluation.

Abstract

Knowledge-grounded dialogue generation is a challenging task because it requires satisfying two fundamental yet often competing constraints: being responsive in a manner that is specific to what the conversation partner has said while also being attributable to an underlying source document. In this work, we bring this trade-off between these two objectives (specificity and attribution) to light and ask the question: Can explicit content planning before the response generation help the model to address this challenge? To answer this question, we design a framework called PLEDGE, which allows us to experiment with various plan variables explored in prior work, supporting both metric-agnostic and metric-aware approaches. While content planning shows promise, our results on whether it can actually help to navigate this trade-off are mixed -- planning mechanisms that are metric-aware (use automatic metrics during training) are better at automatic evaluations but underperform in human judgment compared to metric-agnostic mechanisms. We discuss how this may be caused by over-fitting to automatic metrics and the need for future work to better calibrate these metrics towards human judgment. We hope the observations from our analysis will inform future work that aims to apply content planning in this context.

Investigating Content Planning for Navigating Trade-offs in Knowledge-Grounded Dialogue

TL;DR

This work investigates how explicit content planning can help balance attribution to evidence with conversational specificity in knowledge-grounded dialogue. It introduces PLEDGE (Plan-Edit-Generate), a framework with a planning-enabled generation model and a plan editor that refines plans toward quality estimators, exploring both structural and keyword plan formats. Experiments on Wizard of Wikipedia show that metric-aware planning improves automatic evaluation metrics but often degrades human judgments, highlighting a misalignment between automatic metrics and human perception. The study underscores the need for better-calibrated evaluation metrics and informs future research on planning strategies for grounded dialogue with careful consideration of human-centric evaluation.

Abstract

Knowledge-grounded dialogue generation is a challenging task because it requires satisfying two fundamental yet often competing constraints: being responsive in a manner that is specific to what the conversation partner has said while also being attributable to an underlying source document. In this work, we bring this trade-off between these two objectives (specificity and attribution) to light and ask the question: Can explicit content planning before the response generation help the model to address this challenge? To answer this question, we design a framework called PLEDGE, which allows us to experiment with various plan variables explored in prior work, supporting both metric-agnostic and metric-aware approaches. While content planning shows promise, our results on whether it can actually help to navigate this trade-off are mixed -- planning mechanisms that are metric-aware (use automatic metrics during training) are better at automatic evaluations but underperform in human judgment compared to metric-agnostic mechanisms. We discuss how this may be caused by over-fitting to automatic metrics and the need for future work to better calibrate these metrics towards human judgment. We hope the observations from our analysis will inform future work that aims to apply content planning in this context.
Paper Structure (32 sections, 7 equations, 5 figures, 8 tables)

This paper contains 32 sections, 7 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Knowledge-grounded responses need to optimize multiple qualities such as attribution to the evidence document or conversational specificity.
  • Figure 2: An intuitive overview of the methodology followed in this work to investigate content planning in knowledge-grounded dialogue. We explore plans that use structural variables and keywords.
  • Figure 3: Tradeoff between attribution and specificity scores: We experiment with masking over different portions of the input given to T5. By simply dropping portions of the evidence or the conversation history, the generated response increases along the specificity or attribution axes respectively, but at the expense of the other score. This shows that these metrics can be gamed when looking at either one in isolation from the other.
  • Figure 4: Plan-Edit-Generate framework (PLEDGE) -- A general purpose methodology to analyze the benefits of diverse forms of content planning in knowledge-grounded dialogue. PLEDGE consists of two modules -- the primary plan-based response generation model $G$ (Section \ref{['sec:pledge:generation-model']}, and a plan editing model $E_Q$ that learns to modify a given candidate plan so as to better satisfy the quality estimator $Q$. More details in Section \ref{['sec:methods:pledge']} and Appendix \ref{['sec:plan-editor-appendix']}.
  • Figure 5: Harmonic mean of attribution and specificity scores increases as plan is edited