Table of Contents
Fetching ...

Data-Augmentation-Based Dialectal Adaptation for LLMs

Fahim Faisal, Antonios Anastasopoulos

TL;DR

This work tackles dialectal commonsense reasoning for South Slavic varieties via the DIALECT-COPA task. It proposes a data augmentation framework that combines encoder-based and multilingual LLMs, leveraging transliteration, synthetic data, and cross-lingual data mixing to overcome training data scarcity, particularly for Chakavian. Through a phased experimental design, the authors show substantial gains across Chakavian, Cherkano, and Torlak, with open-source Aya-101 and BERTić models benefiting notably from augmentation and instruction tuning, while large closed models like GPT-4 lead overall in the test set. The study highlights the practical value of augmentation for low-resource dialects and points to promising directions in data-driven dialect adaptation and efficient tuning, enabling broader natural language understanding in non-standard varieties.

Abstract

This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024, which focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. The task aims to assess how well LLMs can handle non-standard dialectal varieties, as their performance on standard languages is already well-established. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance on three South Slavic dialects: Chakavian, Cherkano, and Torlak. We conduct experiments using a language-family-focused encoder-based model (BERTić) and a domain-agnostic multilingual model (AYA-101). Our results demonstrate that the proposed data augmentation techniques lead to substantial performance gains across all three test datasets in the open-source model category. This work highlights the practical utility of data augmentation and the potential of LLMs in handling non-standard dialectal varieties, contributing to the broader goal of advancing natural language understanding in low-resource and dialectal settings. Code:https://github.com/ffaisal93/dialect_copa

Data-Augmentation-Based Dialectal Adaptation for LLMs

TL;DR

This work tackles dialectal commonsense reasoning for South Slavic varieties via the DIALECT-COPA task. It proposes a data augmentation framework that combines encoder-based and multilingual LLMs, leveraging transliteration, synthetic data, and cross-lingual data mixing to overcome training data scarcity, particularly for Chakavian. Through a phased experimental design, the authors show substantial gains across Chakavian, Cherkano, and Torlak, with open-source Aya-101 and BERTić models benefiting notably from augmentation and instruction tuning, while large closed models like GPT-4 lead overall in the test set. The study highlights the practical value of augmentation for low-resource dialects and points to promising directions in data-driven dialect adaptation and efficient tuning, enabling broader natural language understanding in non-standard varieties.

Abstract

This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024, which focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. The task aims to assess how well LLMs can handle non-standard dialectal varieties, as their performance on standard languages is already well-established. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance on three South Slavic dialects: Chakavian, Cherkano, and Torlak. We conduct experiments using a language-family-focused encoder-based model (BERTić) and a domain-agnostic multilingual model (AYA-101). Our results demonstrate that the proposed data augmentation techniques lead to substantial performance gains across all three test datasets in the open-source model category. This work highlights the practical utility of data augmentation and the potential of LLMs in handling non-standard dialectal varieties, contributing to the broader goal of advancing natural language understanding in low-resource and dialectal settings. Code:https://github.com/ffaisal93/dialect_copa
Paper Structure (20 sections, 7 tables)