Table of Contents
Fetching ...

Deep learning models for representing out-of-vocabulary words

Johannes V. Lochter, Renato M. Silva, Tiago A. Almeida

TL;DR

This work evaluates how well deep learning models represent out-of-vocabulary words across intrinsic and extrinsic NLP tasks. By comparing morphology- and context-based approaches (e.g., FastText, Mimick, Comick, HiCE) with a range of DL architectures (LSTM, Transformer variants, GPT-2, RoBERTa, DistilBERT, Electra), the authors reveal task-dependent strengths: DistilBERT performs best intrinsically on the Chimera benchmark, while Comick frequently achieves top extrinsic performance in text categorization. The results highlight that no single method dominates across all tasks, with FastText often leading in NER/POS tagging and RoBERTa lagging in OOV handling. The findings underscore the need for task-specific OOV strategies and suggest combining morphology, context, and dictionary-based cues for robust downstream performance, especially in noisy or domain-specific data contexts.

Abstract

Communication has become increasingly dynamic with the popularization of social networks and applications that allow people to express themselves and communicate instantly. In this scenario, distributed representation models have their quality impacted by new words that appear frequently or that are derived from spelling errors. These words that are unknown by the models, known as out-of-vocabulary (OOV) words, need to be properly handled to not degrade the quality of the natural language processing (NLP) applications, which depend on the appropriate vector representation of the texts. To better understand this problem and finding the best techniques to handle OOV words, in this study, we present a comprehensive performance evaluation of deep learning models for representing OOV words. We performed an intrinsic evaluation using a benchmark dataset and an extrinsic evaluation using different NLP tasks: text categorization, named entity recognition, and part-of-speech tagging. Although the results indicated that the best technique for handling OOV words is different for each task, Comick, a deep learning method that infers the embedding based on the context and the morphological structure of the OOV word, obtained promising results.

Deep learning models for representing out-of-vocabulary words

TL;DR

This work evaluates how well deep learning models represent out-of-vocabulary words across intrinsic and extrinsic NLP tasks. By comparing morphology- and context-based approaches (e.g., FastText, Mimick, Comick, HiCE) with a range of DL architectures (LSTM, Transformer variants, GPT-2, RoBERTa, DistilBERT, Electra), the authors reveal task-dependent strengths: DistilBERT performs best intrinsically on the Chimera benchmark, while Comick frequently achieves top extrinsic performance in text categorization. The results highlight that no single method dominates across all tasks, with FastText often leading in NER/POS tagging and RoBERTa lagging in OOV handling. The findings underscore the need for task-specific OOV strategies and suggest combining morphology, context, and dictionary-based cues for robust downstream performance, especially in noisy or domain-specific data contexts.

Abstract

Communication has become increasingly dynamic with the popularization of social networks and applications that allow people to express themselves and communicate instantly. In this scenario, distributed representation models have their quality impacted by new words that appear frequently or that are derived from spelling errors. These words that are unknown by the models, known as out-of-vocabulary (OOV) words, need to be properly handled to not degrade the quality of the natural language processing (NLP) applications, which depend on the appropriate vector representation of the texts. To better understand this problem and finding the best techniques to handle OOV words, in this study, we present a comprehensive performance evaluation of deep learning models for representing OOV words. We performed an intrinsic evaluation using a benchmark dataset and an extrinsic evaluation using different NLP tasks: text categorization, named entity recognition, and part-of-speech tagging. Although the results indicated that the best technique for handling OOV words is different for each task, Comick, a deep learning method that infers the embedding based on the context and the morphological structure of the OOV word, obtained promising results.

Paper Structure

This paper contains 18 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Comick architecture garneau:2018-pieoovdt.
  • Figure 2: HiCE architecture hu:2019_hice.
  • Figure 3: Example of a Chimera defined by four sentences.