Table of Contents
Fetching ...

Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish

Arda Uzunoglu, Gözde Gül Şahin

TL;DR

This work tackles procedural language understanding (PLU) in low-resource languages by building a Turkish PLU benchmark derived from wikiHow through automated translation complemented by human validation. It introduces a 52k-tutorial Turkish corpus and six downstream tasks (linking actions, goal inference, step inference, step ordering, next event prediction, summarization) to evaluate language models, with a retrieval-reranking and multilingual-plus-Turkish baselines. The results show language-specific models generally outperform multilingual ones, but model size and data quality critically influence performance; even the best Turkish models lag behind English benchmarks, indicating room for improvement. The authors publicly release all data, task splits, and baselines to spur further research in PLU for non-English languages.

Abstract

Understanding procedural natural language (e.g., step-by-step instructions) is a crucial step to execution and planning. However, while there are ample corpora and downstream tasks available in English, the field lacks such resources for most languages. To address this gap, we conduct a case study on Turkish procedural texts. We first expand the number of tutorials in Turkish wikiHow from 2,000 to 52,000 using automated translation tools, where the translation quality and loyalty to the original meaning are validated by a team of experts on a random set. Then, we generate several downstream tasks on the corpus, such as linking actions, goal inference, and summarization. To tackle these tasks, we implement strong baseline models via fine-tuning large language-specific models such as TR-BART and BERTurk, as well as multilingual models such as mBART, mT5, and XLM. We find that language-specific models consistently outperform their multilingual models by a significant margin across most procedural language understanding (PLU) tasks. We release our corpus, downstream tasks and the baseline models with https://github.com/ GGLAB-KU/turkish-plu.

Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish

TL;DR

This work tackles procedural language understanding (PLU) in low-resource languages by building a Turkish PLU benchmark derived from wikiHow through automated translation complemented by human validation. It introduces a 52k-tutorial Turkish corpus and six downstream tasks (linking actions, goal inference, step inference, step ordering, next event prediction, summarization) to evaluate language models, with a retrieval-reranking and multilingual-plus-Turkish baselines. The results show language-specific models generally outperform multilingual ones, but model size and data quality critically influence performance; even the best Turkish models lag behind English benchmarks, indicating room for improvement. The authors publicly release all data, task splits, and baselines to spur further research in PLU for non-English languages.

Abstract

Understanding procedural natural language (e.g., step-by-step instructions) is a crucial step to execution and planning. However, while there are ample corpora and downstream tasks available in English, the field lacks such resources for most languages. To address this gap, we conduct a case study on Turkish procedural texts. We first expand the number of tutorials in Turkish wikiHow from 2,000 to 52,000 using automated translation tools, where the translation quality and loyalty to the original meaning are validated by a team of experts on a random set. Then, we generate several downstream tasks on the corpus, such as linking actions, goal inference, and summarization. To tackle these tasks, we implement strong baseline models via fine-tuning large language-specific models such as TR-BART and BERTurk, as well as multilingual models such as mBART, mT5, and XLM. We find that language-specific models consistently outperform their multilingual models by a significant margin across most procedural language understanding (PLU) tasks. We release our corpus, downstream tasks and the baseline models with https://github.com/ GGLAB-KU/turkish-plu.
Paper Structure (48 sections, 3 figures, 11 tables)

This paper contains 48 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: An example step with a hyperlink redirecting it to a tutorial. (Step says "Connect your printer to your computer" and the redirected tutorial has the title of "How to Connect a Printer to a Computer")
  • Figure 2: An example step from the "How to Get Mold Out of Clothing" tutorial. The bolded part is the step headline, used as the summary, while the step description serves as the text to be summarized. The step description does not include the step headline, formulating the summarization task as the abstractive summarizaton.
  • Figure 3: Performances of the BERTurk-based reranking models trained with different percentages of the translated data's train split. a) shows the performance change on R@1 and b) on R@10. 0% means reranking model is trained only with the originally Turkish data.