Table of Contents
Fetching ...

Dia-Lingle: A Gamified Interface for Dialectal Data Collection

Jiugeng Sun, Rita Sevastjanova, Sina Ahmadi, Rico Sennrich, Mennatallah El-Assady

TL;DR

The paper tackles the scarcity of dialectal data in NLP by introducing Dia-Lingle, a gamified interface that combines two data-collection components (Quiz and Match) with an uncertainty-driven active-learning loop to expand dialect corpora. It documents a dialect classifier, a hexagon-based geographic visualization, and a multi-layer interface designed to sustain user engagement through progressive difficulty. The key contributions are (i) a gamified data-collection approach for dialectal resources, (ii) integration of active learning to guide sentence selection, and (iii) a visualization-centric method for representing dialect-region coverage, validated by usability studies showing high user satisfaction. By enabling community participation and providing a scalable, interpretable data-collection workflow, Dia-Lingle supports more inclusive and dialect-aware NLP technologies and future dialect-specific language modeling.

Abstract

Dialects suffer from the scarcity of computational textual resources as they exist predominantly in spoken rather than written form and exhibit remarkable geographical diversity. Collecting dialect data and subsequently integrating it into current language technologies present significant obstacles. Gamification has been proven to facilitate remote data collection processes with great ease and on a substantially wider scale. This paper introduces Dia-Lingle, a gamified interface aimed to improve and facilitate dialectal data collection tasks such as corpus expansion and dialect labelling. The platform features two key components: the first challenges users to rewrite sentences in their dialects, identifies them through a classifier and solicits feedback, and the other one asks users to match sentences to their geographical locations. Dia-Lingle combines active learning with gamified difficulty levels, strategically encouraging prolonged user engagement while efficiently enriching the dialect corpus. Usability evaluation shows that our interface demonstrates high levels of user satisfaction. We provide the link to Dia-Lingle: https://dia-lingle.ivia.ch/, and demo video: https://youtu.be/0QyJsB8ym64.

Dia-Lingle: A Gamified Interface for Dialectal Data Collection

TL;DR

The paper tackles the scarcity of dialectal data in NLP by introducing Dia-Lingle, a gamified interface that combines two data-collection components (Quiz and Match) with an uncertainty-driven active-learning loop to expand dialect corpora. It documents a dialect classifier, a hexagon-based geographic visualization, and a multi-layer interface designed to sustain user engagement through progressive difficulty. The key contributions are (i) a gamified data-collection approach for dialectal resources, (ii) integration of active learning to guide sentence selection, and (iii) a visualization-centric method for representing dialect-region coverage, validated by usability studies showing high user satisfaction. By enabling community participation and providing a scalable, interpretable data-collection workflow, Dia-Lingle supports more inclusive and dialect-aware NLP technologies and future dialect-specific language modeling.

Abstract

Dialects suffer from the scarcity of computational textual resources as they exist predominantly in spoken rather than written form and exhibit remarkable geographical diversity. Collecting dialect data and subsequently integrating it into current language technologies present significant obstacles. Gamification has been proven to facilitate remote data collection processes with great ease and on a substantially wider scale. This paper introduces Dia-Lingle, a gamified interface aimed to improve and facilitate dialectal data collection tasks such as corpus expansion and dialect labelling. The platform features two key components: the first challenges users to rewrite sentences in their dialects, identifies them through a classifier and solicits feedback, and the other one asks users to match sentences to their geographical locations. Dia-Lingle combines active learning with gamified difficulty levels, strategically encouraging prolonged user engagement while efficiently enriching the dialect corpus. Usability evaluation shows that our interface demonstrates high levels of user satisfaction. We provide the link to Dia-Lingle: https://dia-lingle.ivia.ch/, and demo video: https://youtu.be/0QyJsB8ym64.

Paper Structure

This paper contains 24 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustrative workflow of Dia-Lingle with colour-encoded components for clarity.
  • Figure 2: Illustration of dialect representation in Dia-Lingle using Dialect X as an example spoken primarily in the Graubünden region of Switzerland.
  • Figure 3: Illustration of a parallel sentence group in Swiss German. There is one standardised sentence and multiple dialect sentences that convey the exact same meaning.
  • Figure 4: Simplified overview of Dia-Lingle interface design, detailed in Section \ref{['sec:interface_design']}. Major components are enlarged for visibility and labelled with circled numbers for reference. \ref{['sec:appendix']} provides additional images of other stages.
  • Figure 5: Illustration of the two gamified components (Quiz and Match) as they appear on the Choice page.
  • ...and 2 more figures