Table of Contents
Fetching ...

Autoencoder-Based Framework to Capture Vocabulary Quality in NLP

Vu Minh Hoang Dang, Rakesh M. Verma

TL;DR

This work addresses the inadequacy of traditional lexical metrics to capture contextual, semantic, and structural aspects of vocabulary in NLP datasets. It introduces an autoencoder-based framework that uses neural capacity as a proxy for vocabulary richness, diversity, and complexity, employing two setups (basic non-bottlenecked and squeezed) and evaluating on the DIFrauD and Project Gutenberg corpora. Key findings show that richer vocabularies require wider hidden layers, results are robust to language and text length but sensitive to lexical depth and historical complexity, and there are notable differences between 18th- and 20th-century texts. While offering a flexible, data-driven approach, the work notes computational overhead and the need to combine its proxy with other measures, with future directions including noisy/low-resource data and contextual embeddings to enhance evaluation.

Abstract

Linguistic richness is essential for advancing natural language processing (NLP), as dataset characteristics often directly influence model performance. However, traditional metrics such as Type-Token Ratio (TTR), Vocabulary Diversity (VOCD), and Measure of Lexical Text Diversity (MTLD) do not adequately capture contextual relationships, semantic richness, and structural complexity. In this paper, we introduce an autoencoder-based framework that uses neural network capacity as a proxy for vocabulary richness, diversity, and complexity, enabling a dynamic assessment of the interplay between vocabulary size, sentence structure, and contextual depth. We validate our approach on two distinct datasets: the DIFrauD dataset, which spans multiple domains of deceptive and fraudulent text, and the Project Gutenberg dataset, representing diverse languages, genres, and historical periods. Experimental results highlight the robustness and adaptability of our method, offering practical guidance for dataset curation and NLP model design. By enhancing traditional vocabulary evaluation, our work fosters the development of more context-aware, linguistically adaptive NLP systems.

Autoencoder-Based Framework to Capture Vocabulary Quality in NLP

TL;DR

This work addresses the inadequacy of traditional lexical metrics to capture contextual, semantic, and structural aspects of vocabulary in NLP datasets. It introduces an autoencoder-based framework that uses neural capacity as a proxy for vocabulary richness, diversity, and complexity, employing two setups (basic non-bottlenecked and squeezed) and evaluating on the DIFrauD and Project Gutenberg corpora. Key findings show that richer vocabularies require wider hidden layers, results are robust to language and text length but sensitive to lexical depth and historical complexity, and there are notable differences between 18th- and 20th-century texts. While offering a flexible, data-driven approach, the work notes computational overhead and the need to combine its proxy with other measures, with future directions including noisy/low-resource data and contextual embeddings to enhance evaluation.

Abstract

Linguistic richness is essential for advancing natural language processing (NLP), as dataset characteristics often directly influence model performance. However, traditional metrics such as Type-Token Ratio (TTR), Vocabulary Diversity (VOCD), and Measure of Lexical Text Diversity (MTLD) do not adequately capture contextual relationships, semantic richness, and structural complexity. In this paper, we introduce an autoencoder-based framework that uses neural network capacity as a proxy for vocabulary richness, diversity, and complexity, enabling a dynamic assessment of the interplay between vocabulary size, sentence structure, and contextual depth. We validate our approach on two distinct datasets: the DIFrauD dataset, which spans multiple domains of deceptive and fraudulent text, and the Project Gutenberg dataset, representing diverse languages, genres, and historical periods. Experimental results highlight the robustness and adaptability of our method, offering practical guidance for dataset curation and NLP model design. By enhancing traditional vocabulary evaluation, our work fosters the development of more context-aware, linguistically adaptive NLP systems.

Paper Structure

This paper contains 21 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: This plot shows the change in the DQI 1 metric (Y-axis), which measures vocabulary quality, across different numbers of unique rows in a fixed sample size of 1000 rows (X-axis). The DQI 1 value decreases as the number of unique rows increases.
  • Figure 2: Scatter plot of Type-Token Ratios (TTR) vs. token counts for multilingual Gutenberg datasets, showing TTR's sensitivity to dataset size and its limitations for comparing vocabulary diversity in large corpora. Curve fits highlight consistent trends across languages.
  • Figure 3: Scatter plot showing the relationship between VOCD and Work Length (in tokens) across multiple languages. Each color represents a different language, with corresponding linear fit lines indicating trends. The negative slope suggests a decline in VOCD as work length increases.
  • Figure 4: Scatter plot depicting the relationship between MTLD and Work Length (in tokens) across various languages. Each color represents a different language, with linear trend lines showing a generally positive correlation, suggesting that MTLD tends to increase slightly as work length grows.
  • Figure 5: Framework for Evaluating Vocabulary Quality Using an Autoencoder Model. The process begins with preprocessing and setup definition, followed by autoencoder training and evaluation, resulting in model accuracy and insights into vocabulary richness, diversity, and complexity.
  • ...and 3 more figures