Table of Contents
Fetching ...

Evaluating Pixel Language Models on Non-Standardized Languages

Alberto Muñoz-Ortiz, Verena Blaschke, Barbara Plank

TL;DR

This study investigates pixel-based language modeling as a solution to dialectal variability that hampers token-based NLP. By rendering text as 16x16 patches processed by a Vision Transformer, the PIXEL approach avoids vocabulary expansion and shows robust zero-shot transfer from Standard German to several dialects, particularly in syntactic tasks and intent detection. Across four downstream tasks, PIXEL generally surpasses token-based BERT on dialects for POS tagging, dependency parsing, and some semantic tasks, yet underperforms in topic classification and standard German settings. The findings highlight the potential of patch-based, language-agnostic representations for handling orthographic noise and non-standard varieties, with implications for low-resource and multilingual NLP, albeit with computational and data-availability considerations.

Abstract

We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves especially useful for out-of-vocabulary words common in dialectal data. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks. Our results show that pixel-based models outperform token-based models in part-of-speech tagging, dependency parsing and intent detection for zero-shot dialect evaluation by up to 26 percentage points in some scenarios, though not in Standard German. However, pixel-based models fall short in topic classification. These findings emphasize the potential of pixel-based models for handling dialectal data, though further research should be conducted to assess their effectiveness in various linguistic contexts.

Evaluating Pixel Language Models on Non-Standardized Languages

TL;DR

This study investigates pixel-based language modeling as a solution to dialectal variability that hampers token-based NLP. By rendering text as 16x16 patches processed by a Vision Transformer, the PIXEL approach avoids vocabulary expansion and shows robust zero-shot transfer from Standard German to several dialects, particularly in syntactic tasks and intent detection. Across four downstream tasks, PIXEL generally surpasses token-based BERT on dialects for POS tagging, dependency parsing, and some semantic tasks, yet underperforms in topic classification and standard German settings. The findings highlight the potential of patch-based, language-agnostic representations for handling orthographic noise and non-standard varieties, with implications for low-resource and multilingual NLP, albeit with computational and data-availability considerations.

Abstract

We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves especially useful for out-of-vocabulary words common in dialectal data. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks. Our results show that pixel-based models outperform token-based models in part-of-speech tagging, dependency parsing and intent detection for zero-shot dialect evaluation by up to 26 percentage points in some scenarios, though not in Standard German. However, pixel-based models fall short in topic classification. These findings emphasize the potential of pixel-based models for handling dialectal data, though further research should be conducted to assess their effectiveness in various linguistic contexts.

Paper Structure

This paper contains 19 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: "Welcome!" in Standard German (a) and the Swiss German Bern dialect (b) tokenized using DBMDZ German BERT and rendered and split in patches by PIXEL. Standard German is tokenized in a more meaningful way, whereas the Bernese dialect form results in multiple non-meaningful sub-tokens due to variations in spelling.
  • Figure 2: Labelled attachment scores (in %) for different dependency distances for models trained on German GSD.