A Benchmark to Assess Common Ground in Human-AI Collaboration

Christian Poelitz; Finale Doshi-Velez; Siân Lindley

A Benchmark to Assess Common Ground in Human-AI Collaboration

Christian Poelitz, Finale Doshi-Velez, Siân Lindley

TL;DR

A new benchmark grounded in theories and empirical studies of human-human collaboration is introduced based on a collaborative puzzle task that requires iterative interaction, joint action, referential coordination, and repair under varying conditions of situation awareness.

Abstract

AI is becoming increasingly integrated into everyday life, both in professional work environments and in leisure and entertainment contexts. This integration requires AI to move beyond acting as an assistant for informational or transactional tasks toward a genuine collaborative partner. Effective collaboration, whether between humans or between humans and AI, depends on establishing and maintaining common ground: shared beliefs, assumptions, goals, and situational awareness that enable coordinated action and efficient repair of misunderstandings. While common ground is a central concept in human collaboration, it has received limited attention in studies of human-AI collaboration. In this paper, we introduce a new benchmark grounded in theories and empirical studies of human-human collaboration. The benchmark is based on a collaborative puzzle task that requires iterative interaction, joint action, referential coordination, and repair under varying conditions of situation awareness. We validate the benchmark through a confirmatory user study in which human participants collaborate with an AI to solve the task. The results show that the benchmark reproduces established theoretical and empirical findings from human-human collaboration, while also revealing clear divergences in human-AI interaction.

A Benchmark to Assess Common Ground in Human-AI Collaboration

TL;DR

Abstract

Paper Structure (25 sections, 38 figures, 3 tables)

This paper contains 25 sections, 38 figures, 3 tables.

Introduction
Background and related work
Human-AI collaboration
Common ground
Communication Medium and Common Ground
Common Ground in human-AI interaction
Benchmark for common ground
Benchmark design
Study - Benchmark validation
Design
Participants
Materials
Procedure
Data Analysis
Comparisons to expected observations on task performance (Task-level)
...and 10 more sections

Figures (38)

Figure 1: Illustration of human–AI interaction tasks arranged by increasing need for common ground. Left: Typical use cases of AI assistants, where the user specifies a task and the system produces a one-shot response. The input largely determines the desired output, ambiguity is minimal, and little or no mutual understanding is required; when misalignment occurs, the burden of repair falls primarily on the human. Right: Use cases in which AI acts as a genuine collaborative partner. Human and AI engage in iterative dialogue, jointly presenting, clarifying, repairing, and accepting contributions while actively building and maintaining common ground. Such tasks require goals and scope to be negotiated, assumptions to be updated, and hypotheses to be jointly constructed and revised over time. Without mutual adaptation and shared understanding, a simple assistant cannot effectively support these interactions; misalignment accumulates and the human bears the full burden of repair Clark1991GroundingIC.
Figure 2: Design of the puzzle task. Left: the Worker's view on solving the current puzzle with working area showing the current progress, available puzzle pieces and chat window. Right: the Helper's view with current Worker's progress (in the shared-view condition), the target puzzle and chat window. In the non-shared view condition, the current Worker's progress is removed from the Helper's view.
Figure 3: Target puzzle solutions across trials 1-4 (a–d). The initial practice puzzle is shown in Fig. \ref{['fig:task_views']}.
Figure 5: Exact matches of the final puzzle configuration compared to target solution across conditions and trial. Left: the shared view condition, Right: the non-shared view condition.
Figure 6: Total number of words used in the messages averaged across roles and across all puzzle trials in each condition (excluding the practice trial). Left: the shared view condition. Right: the non-shared view condition.
...and 33 more figures

A Benchmark to Assess Common Ground in Human-AI Collaboration

TL;DR

Abstract

A Benchmark to Assess Common Ground in Human-AI Collaboration

Authors

TL;DR

Abstract

Table of Contents

Figures (38)