Gesturing Toward Abstraction: Multimodal Convention Formation in Collaborative Physical Tasks

Kiyosu Maeda; William P. McCarthy; Ching-Yi Tsai; Jeffrey Mu; Haoliang Wang; Robert D. Hawkins; Judith E. Fan; Parastoo Abtahi

Gesturing Toward Abstraction: Multimodal Convention Formation in Collaborative Physical Tasks

Kiyosu Maeda, William P. McCarthy, Ching-Yi Tsai, Jeffrey Mu, Haoliang Wang, Robert D. Hawkins, Judith E. Fan, Parastoo Abtahi

TL;DR

This work investigates how people form multimodal conventions in iterative physical collaboration by combining a large online unimodal study with an AR-based laboratory study and a extending probabilistic model. The online study reveals rapid abstraction from block- to tower-level language, while the physical study shows concurrent emergence of linguistic and gestural conventions and cross-modal redundancy to emphasize changes. The authors extend Rational Speech Act with a multimodal lexicon and a domain-specific language to simulate how abstractions and modality preferences evolve across repetitions, and they demonstrate that the model reproduces shortening of instructions and diverging modality strategies observed in participants. The findings advance convention-aware intelligent agents that learn user-specific multimodal conventions and adapt to shifts in communication costs and preferences, enabling more efficient human–robot collaboration in real-world assembly tasks.

Abstract

A quintessential feature of human intelligence is the ability to create ad hoc conventions over time to achieve shared goals efficiently. We investigate how communication strategies evolve through repeated collaboration as people coordinate on shared procedural abstractions. To this end, we conducted an online unimodal study (n = 98) using natural language to probe abstraction hierarchies. In a follow-up lab study (n = 40), we examined how multimodal communication (speech and gestures) changed during physical collaboration. Pairs used augmented reality to isolate their partner's hand and voice; one participant viewed a 3D virtual tower and sent instructions to the other, who built the physical tower. Participants became faster and more accurate by establishing linguistic and gestural abstractions and using cross-modal redundancy to emphasize key changes from previous interactions. Based on these findings, we extend probabilistic models of convention formation to multimodal settings, capturing shifts in modality preferences. Our findings and model provide building blocks for designing convention-aware intelligent agents situated in the physical world.

Gesturing Toward Abstraction: Multimodal Convention Formation in Collaborative Physical Tasks

TL;DR

Abstract

Paper Structure (56 sections, 8 equations, 15 figures, 2 tables)

This paper contains 56 sections, 8 equations, 15 figures, 2 tables.

Introduction
Related Work
Gestures in Thought and Communication
Multimodal Cues in Collaborative Tasks
Conventions in Repeated Collaboration
Online Unimodal Study
Method
Participants
Stimuli
Results
Reconstruction accuracy improves across repetitions
Communicative efficiency improves across repetitions
Level of referential abstraction increases across repetitions
Physical Multimodal Study
Method
...and 41 more sections

Figures (15)

Figure 1: (A) Instructor viewed a target scene and gave assembly instructions to the Builder. (B) Scenes with two towers.
Figure 2: (A) Mean reconstruction accuracy improved across repetitions. (B) Mean instruction length per trial decreased across repetitions as dyads became more effective at collaborating. (C) Words with the largest positive or negative changes in frequency from R1 to R4. (D) Change in the number of block- and tower-level references. Dashed lines are the maximum possible number of blocks and towers. (E) The proportion of expressions exclusively referring to blocks or towers. Error bars: 95% CIs.
Figure 3: Example messages showing the emergence of tower-level expressions: upside down U, long C, and long L.
Figure 4: Physical LEGO blocks: blue (x-axis), red (y-axis), green (z-axis); and three-block towers: TREE (unknown mapping, 3D), L (alphabetic, 2D x-y), and C (alphabetic, 2D z-y).
Figure 5: (A) The Instructor recorded a multimodal message (speech and gesture) in augmented reality (AR) describing how to build a virtual target tower with a specific pose. (B) The Builder replayed the message (audio and overlaid AR hands) and built the tower with physical blocks.
...and 10 more figures

Gesturing Toward Abstraction: Multimodal Convention Formation in Collaborative Physical Tasks

TL;DR

Abstract

Gesturing Toward Abstraction: Multimodal Convention Formation in Collaborative Physical Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (15)