Table of Contents
Fetching ...

MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank

Verena Blaschke, Barbara Kovačić, Siyao Peng, Hinrich Schütze, Barbara Plank

TL;DR

The first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD is presented, covering multiple text genres and the morphosyntactic differences between the closely-related Bavarian and German are highlighted.

Abstract

Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth': most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.

MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank

TL;DR

The first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD is presented, covering multiple text genres and the morphosyntactic differences between the closely-related Bavarian and German are highlighted.

Abstract

Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth': most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.
Paper Structure (48 sections, 2 figures, 7 tables)

This paper contains 48 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Bavarian dialect groups in Germany, Austria and Italy, based on the classification by wiesinger1983einteilung. Names of dialect groups are in small caps, names of provinces and states in italics.
  • Figure 2: Gold-standard (top) and predicted (bottom) annotations. Predictions are produced by the UDPipe model trained on GSD, the best system in our evaluation. Wrong predictions are in red. 'The Lammer (river) has fairly clean water and is also pretty popular for whitewater sports.' (Wiki Låmma 'Lammer')