Table of Contents
Fetching ...

Learning Mutually Informed Representations for Characters and Subwords

Yilin Wang, Xinyi Hu, Matthew R. Gormley

TL;DR

The paper tackles subword tokenization limitations by introducing the entanglement model, which treats character and subword language models as two modalities connected through cross-attention. This two-stream architecture exchanges information across multiple co-attention layers to yield mutually informed representations, enabling good performance on sequence labeling and text classification across English and multilingual tasks, including noisy text and intraword code-switching. Empirical results show consistent gains over backbone models, with notable improvements in low-resource languages and scenarios where character information is especially informative; in several English tasks, the entanglement approach even matches or surpasses larger pretrained subword models. Extensions like explicit positional embeddings and MLM pretraining did not consistently improve performance, suggesting the core co-attention mechanism effectively aligns character- and subword-level information during fine-tuning. The work advances practical multilingual NLP by providing a flexible, seemingly scalable method to fuse granular textual signals without extensive pretraining overhead, and it offers a solid foundation for future exploration of stronger character backbones and broader backbone combinations.

Abstract

Most pretrained language models rely on subword tokenization, which processes text as a sequence of subword tokens. However, different granularities of text, such as characters, subwords, and words, can contain different kinds of information. Previous studies have shown that incorporating multiple input granularities improves model generalization, yet very few of them outputs useful representations for each granularity. In this paper, we introduce the entanglement model, aiming to combine character and subword language models. Inspired by vision-language models, our model treats characters and subwords as separate modalities, and it generates mutually informed representations for both granularities as output. We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling (intraword code-switching). Notably, the entanglement model outperforms its backbone language models, particularly in the presence of noisy texts and low-resource languages. Furthermore, the entanglement model even outperforms larger pre-trained models on all English sequence labeling tasks and classification tasks. We make our code publically available.

Learning Mutually Informed Representations for Characters and Subwords

TL;DR

The paper tackles subword tokenization limitations by introducing the entanglement model, which treats character and subword language models as two modalities connected through cross-attention. This two-stream architecture exchanges information across multiple co-attention layers to yield mutually informed representations, enabling good performance on sequence labeling and text classification across English and multilingual tasks, including noisy text and intraword code-switching. Empirical results show consistent gains over backbone models, with notable improvements in low-resource languages and scenarios where character information is especially informative; in several English tasks, the entanglement approach even matches or surpasses larger pretrained subword models. Extensions like explicit positional embeddings and MLM pretraining did not consistently improve performance, suggesting the core co-attention mechanism effectively aligns character- and subword-level information during fine-tuning. The work advances practical multilingual NLP by providing a flexible, seemingly scalable method to fuse granular textual signals without extensive pretraining overhead, and it offers a solid foundation for future exploration of stronger character backbones and broader backbone combinations.

Abstract

Most pretrained language models rely on subword tokenization, which processes text as a sequence of subword tokens. However, different granularities of text, such as characters, subwords, and words, can contain different kinds of information. Previous studies have shown that incorporating multiple input granularities improves model generalization, yet very few of them outputs useful representations for each granularity. In this paper, we introduce the entanglement model, aiming to combine character and subword language models. Inspired by vision-language models, our model treats characters and subwords as separate modalities, and it generates mutually informed representations for both granularities as output. We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling (intraword code-switching). Notably, the entanglement model outperforms its backbone language models, particularly in the presence of noisy texts and low-resource languages. Furthermore, the entanglement model even outperforms larger pre-trained models on all English sequence labeling tasks and classification tasks. We make our code publically available.
Paper Structure (42 sections, 12 equations, 3 figures, 9 tables)

This paper contains 42 sections, 12 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Architecture of the entanglement model.
  • Figure 2: Architecture of the CO-TRM block inside the co-attention module.
  • Figure 3: Character-word matching loss