Table of Contents
Fetching ...

Revealing Trends in Datasets from the 2022 ACL and EMNLP Conferences

Jesse Atuhurra, Hidetaka Kamigaito

TL;DR

This paper analyzes datasets introduced at ACL 2022 and EMNLP 2022 to uncover trends in NLP data curation and its impact on PLM performance. It systematically extracts attributes such as task coverage, dataset size, baselines, multilinguality, multimodality, and author-affiliations from 92 papers. The study highlights a rise in multimodal and multilingual benchmarks, reveals collaboration patterns between academia and industry, and documents diverse data sources and generation methods, including prompting LLMs. The findings aim to guide researchers in curating higher-quality datasets and inform future benchmark design and policy decisions in NLP.

Abstract

Natural language processing (NLP) has grown significantly since the advent of the Transformer architecture. Transformers have given birth to pre-trained large language models (PLMs). There has been tremendous improvement in the performance of NLP systems across several tasks. NLP systems are on par or, in some cases, better than humans at accomplishing specific tasks. However, it remains the norm that \emph{better quality datasets at the time of pretraining enable PLMs to achieve better performance, regardless of the task.} The need to have quality datasets has prompted NLP researchers to continue creating new datasets to satisfy particular needs. For example, the two top NLP conferences, ACL and EMNLP, accepted ninety-two papers in 2022, introducing new datasets. This work aims to uncover the trends and insights mined within these datasets. Moreover, we provide valuable suggestions to researchers interested in curating datasets in the future.

Revealing Trends in Datasets from the 2022 ACL and EMNLP Conferences

TL;DR

This paper analyzes datasets introduced at ACL 2022 and EMNLP 2022 to uncover trends in NLP data curation and its impact on PLM performance. It systematically extracts attributes such as task coverage, dataset size, baselines, multilinguality, multimodality, and author-affiliations from 92 papers. The study highlights a rise in multimodal and multilingual benchmarks, reveals collaboration patterns between academia and industry, and documents diverse data sources and generation methods, including prompting LLMs. The findings aim to guide researchers in curating higher-quality datasets and inform future benchmark design and policy decisions in NLP.

Abstract

Natural language processing (NLP) has grown significantly since the advent of the Transformer architecture. Transformers have given birth to pre-trained large language models (PLMs). There has been tremendous improvement in the performance of NLP systems across several tasks. NLP systems are on par or, in some cases, better than humans at accomplishing specific tasks. However, it remains the norm that \emph{better quality datasets at the time of pretraining enable PLMs to achieve better performance, regardless of the task.} The need to have quality datasets has prompted NLP researchers to continue creating new datasets to satisfy particular needs. For example, the two top NLP conferences, ACL and EMNLP, accepted ninety-two papers in 2022, introducing new datasets. This work aims to uncover the trends and insights mined within these datasets. Moreover, we provide valuable suggestions to researchers interested in curating datasets in the future.
Paper Structure (17 sections, 4 figures, 17 tables)

This paper contains 17 sections, 4 figures, 17 tables.

Figures (4)

  • Figure 1: An overview of major NLP tasks covered in datasets published in ACL and EMNLP in 2022.
  • Figure 2: Datasets, major NLP tasks, and author-affiliation for datasets published at ACL and EMNLP in 2022.
  • Figure 3: The number of non-English monolingual datasets. Patois* refers to Jamaican Patois.
  • Figure 4: The varying sizes of datasets. Most datasets tend to have between 10K—50K samples.