Grounding Gaps in Language Model Generations
Omar Shaikh, Kristina Gligorić, Ashna Khetan, Matthias Gerstgrasser, Diyi Yang, Dan Jurafsky
TL;DR
Grounding is essential for effective dialogue, but LLMs often fail to generate human-like grounding acts, risking miscommunication in sensitive domains. The authors curate a set of grounding acts, build a labeling classifier, and simulate turn-taking to quantify a grounding gap between humans and LLMs across emotional support, education, and persuasion datasets. They find LLMs substantially underuse grounding acts and show weak agreement with human grounding, with instruction tuning and preference optimization further reducing grounding acts. Prompting mitigations increase act frequency but do not improve alignment, suggesting that addressing grounding requires training data and objectives that explicitly encode multi-turn grounding strategies. Overall, the work highlights a crucial direction for safer and more effective human-AI dialogue.
Abstract
Effective conversation requires common ground: a shared understanding between the participants. Common ground, however, does not emerge spontaneously in conversation. Speakers and listeners work together to both identify and construct a shared basis while avoiding misunderstanding. To accomplish grounding, humans rely on a range of dialogue acts, like clarification (What do you mean?) and acknowledgment (I understand.). However, it is unclear whether large language models (LLMs) generate text that reflects human grounding. To this end, we curate a set of grounding acts and propose corresponding metrics that quantify attempted grounding. We study whether LLM generations contain grounding acts, simulating turn-taking from several dialogue datasets and comparing results to humans. We find that -- compared to humans -- LLMs generate language with less conversational grounding, instead generating text that appears to simply presume common ground. To understand the roots of the identified grounding gap, we examine the role of instruction tuning and preference optimization, finding that training on contemporary preference data leads to a reduction in generated grounding acts. Altogether, we highlight the need for more research investigating conversational grounding in human-AI interaction.
