Table of Contents
Fetching ...

Evaluation Metrics for Text Data Augmentation in NLP

Marcellus Amadeus, William Alberto Cruz Castañeda

TL;DR

This paper addresses the fragmentation in evaluating text data augmentation methods in NLP due to task-, dataset-, and model-specific metrics. It proposes a comprehensive taxonomy of evaluation metrics across ten categories and links each to concrete tools and implementation guidance. The work surveys literature from 2018 to 2023 to map existing metrics (e.g., BLEU, CHRF, BERTScore, WER) and highlights gaps toward a unified benchmark. The proposed framework aims to improve comparability, reproducibility, and adoption of standardized evaluation practices in both research and industry.

Abstract

Recent surveys on data augmentation for natural language processing have reported different techniques and advancements in the field. Several frameworks, tools, and repositories promote the implementation of text data augmentation pipelines. However, a lack of evaluation criteria and standards for method comparison due to different tasks, metrics, datasets, architectures, and experimental settings makes comparisons meaningless. Also, a lack of methods unification exists and text data augmentation research would benefit from unified metrics to compare different augmentation methods. Thus, academics and the industry endeavor relevant evaluation metrics for text data augmentation techniques. The contribution of this work is to provide a taxonomy of evaluation metrics for text augmentation methods and serve as a direction for a unified benchmark. The proposed taxonomy organizes categories that include tools for implementation and metrics calculation. Finally, with this study, we intend to present opportunities to explore the unification and standardization of text data augmentation metrics.

Evaluation Metrics for Text Data Augmentation in NLP

TL;DR

This paper addresses the fragmentation in evaluating text data augmentation methods in NLP due to task-, dataset-, and model-specific metrics. It proposes a comprehensive taxonomy of evaluation metrics across ten categories and links each to concrete tools and implementation guidance. The work surveys literature from 2018 to 2023 to map existing metrics (e.g., BLEU, CHRF, BERTScore, WER) and highlights gaps toward a unified benchmark. The proposed framework aims to improve comparability, reproducibility, and adoption of standardized evaluation practices in both research and industry.

Abstract

Recent surveys on data augmentation for natural language processing have reported different techniques and advancements in the field. Several frameworks, tools, and repositories promote the implementation of text data augmentation pipelines. However, a lack of evaluation criteria and standards for method comparison due to different tasks, metrics, datasets, architectures, and experimental settings makes comparisons meaningless. Also, a lack of methods unification exists and text data augmentation research would benefit from unified metrics to compare different augmentation methods. Thus, academics and the industry endeavor relevant evaluation metrics for text data augmentation techniques. The contribution of this work is to provide a taxonomy of evaluation metrics for text augmentation methods and serve as a direction for a unified benchmark. The proposed taxonomy organizes categories that include tools for implementation and metrics calculation. Finally, with this study, we intend to present opportunities to explore the unification and standardization of text data augmentation metrics.
Paper Structure (14 sections, 1 equation, 1 figure, 9 tables)