SPACER: A Parallel Dataset of Speech Production And Comprehension of Error Repairs
Shiva Upadhye, Jiaxuan Li, Richard Futrell
TL;DR
SPACER provides a parallel dataset that links naturalistic speech errors and repairs in production with comprehension edits, enabling integrated study of error monitoring across both modalities. It combines Switchboard-derived single-word substitutions with a web-based correction task, yielding 1056 initial utterances (576 SC, 480 SU) and 5808 comprehender responses across 66 participants. Analyses reveal asymmetries: speakers tend to self-repair when semantic and phonemic deviations are large, whereas comprehenders tend to correct errors that are phonemically similar or contextually unsupported, suggesting complementary strategies. The dataset supports principled, computational investigations into rational-inference models of error correction and offers a resource to bridge production and comprehension research with broad applicability to language science.
Abstract
Speech errors are a natural part of communication, yet they rarely lead to complete communicative failure because both speakers and comprehenders can detect and correct errors. Although prior research has examined error monitoring and correction in production and comprehension separately, integrated investigation of both systems has been impeded by the scarcity of parallel data. In this study, we present SPACER, a parallel dataset that captures how naturalistic speech errors are corrected by both speakers and comprehenders. We focus on single-word substitution errors extracted from the Switchboard corpus, accompanied by speaker's self-repairs and comprehenders' responses from an offline text-editing experiment. Our exploratory analysis suggests asymmetries in error correction strategies: speakers are more likely to repair errors that introduce greater semantic and phonemic deviations, whereas comprehenders tend to correct errors that are phonemically similar to more plausible alternatives or do not fit into prior contexts. Our dataset enables future research on integrated approaches toward studying language production and comprehension.
