Table of Contents
Fetching ...

A Collection of Pragmatic-Similarity Judgments over Spoken Dialog Utterances

Nigel G. Ward, Divette Marco

TL;DR

Pragmatic similarity in spoken dialogue lacks reliable evaluation resources. The paper introduces PragSim, the first dataset of human pragmatic-similarity judgments, using seed-reenactment pairs across English and Spanish with six re-enactment methods and continuous ratings by multiple judges. Inter-annotator agreement reaches up to 0.72, with factors like judge identity, experience, lexical content, and duration differences influencing ratings. The publicly available PragSim dataset enables training and evaluation of pragmatic-similarity metrics, supporting advancements in dialog systems, speech synthesis, machine translation, and language-learning assessment.

Abstract

Automatic measures of similarity between utterances are invaluable for training speech synthesizers, evaluating machine translation, and assessing learner productions. While there exist measures for semantic similarity and prosodic similarity, there are as yet none for pragmatic similarity. To enable the training of such measures, we developed the first collection of human judgments of pragmatic similarity between utterance pairs. Each pair consisting of an utterance extracted from a recorded dialog and a re-enactment of that utterance. Re-enactments were done under various conditions designed to create a variety of degrees of similarity. Each pair was rated on a continuous scale by 6 to 9 judges. The average inter-judge correlation was as high as 0.72 for English and 0.66 for Spanish. We make this data available at https://github.com/divettemarco/PragSim .

A Collection of Pragmatic-Similarity Judgments over Spoken Dialog Utterances

TL;DR

Pragmatic similarity in spoken dialogue lacks reliable evaluation resources. The paper introduces PragSim, the first dataset of human pragmatic-similarity judgments, using seed-reenactment pairs across English and Spanish with six re-enactment methods and continuous ratings by multiple judges. Inter-annotator agreement reaches up to 0.72, with factors like judge identity, experience, lexical content, and duration differences influencing ratings. The publicly available PragSim dataset enables training and evaluation of pragmatic-similarity metrics, supporting advancements in dialog systems, speech synthesis, machine translation, and language-learning assessment.

Abstract

Automatic measures of similarity between utterances are invaluable for training speech synthesizers, evaluating machine translation, and assessing learner productions. While there exist measures for semantic similarity and prosodic similarity, there are as yet none for pragmatic similarity. To enable the training of such measures, we developed the first collection of human judgments of pragmatic similarity between utterance pairs. Each pair consisting of an utterance extracted from a recorded dialog and a re-enactment of that utterance. Re-enactments were done under various conditions designed to create a variety of degrees of similarity. Each pair was rated on a continuous scale by 6 to 9 judges. The average inter-judge correlation was as high as 0.72 for English and 0.66 for Spanish. We make this data available at https://github.com/divettemarco/PragSim .
Paper Structure (20 sections, 3 figures, 4 tables)

This paper contains 20 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Rating Instructions
  • Figure 2: Distributions of judgments for each judge in Session 1.
  • Figure 3: Distributions of the mean per-stimulus ratings for each re-enactment method in Session 1.