Table of Contents
Fetching ...

ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text Encoders

Yan Song, Tong Zhang, Yonggang Wang, Kai-Fu Lee

TL;DR

ZEN 2.0 tackles the limitations of pure character-based encoders by integrating large-scale n-gram information into pre-training. It introduces refined, weighted n-gram representations, whole n-gram masking, and relative positional encoding, scaling the architecture to BERT-large and extending to Arabic in addition to Chinese. Through PMI-driven n-gram lexicons and extensive pre-training data, ZEN 2.0 achieves state-of-the-art results across a broad suite of Chinese and Arabic tasks, with ablations confirming the contribution of each enhancement. The work demonstrates strong cross-language generalization and provides resources to the community to foster further research in n-gram–aware text representations.

Abstract

Pre-trained text encoders have drawn sustaining attention in natural language processing (NLP) and shown their capability in obtaining promising results in different tasks. Recent studies illustrated that external self-supervised signals (or knowledge extracted by unsupervised learning, such as n-grams) are beneficial to provide useful semantic evidence for understanding languages such as Chinese, so as to improve the performance on various downstream tasks accordingly. To further enhance the encoders, in this paper, we propose to pre-train n-gram-enhanced encoders with a large volume of data and advanced techniques for training. Moreover, we try to extend the encoder to different languages as well as different domains, where it is confirmed that the same architecture is applicable to these varying circumstances and new state-of-the-art performance is observed from a long list of NLP tasks across languages and domains.

ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text Encoders

TL;DR

ZEN 2.0 tackles the limitations of pure character-based encoders by integrating large-scale n-gram information into pre-training. It introduces refined, weighted n-gram representations, whole n-gram masking, and relative positional encoding, scaling the architecture to BERT-large and extending to Arabic in addition to Chinese. Through PMI-driven n-gram lexicons and extensive pre-training data, ZEN 2.0 achieves state-of-the-art results across a broad suite of Chinese and Arabic tasks, with ablations confirming the contribution of each enhancement. The work demonstrates strong cross-language generalization and provides resources to the community to foster further research in n-gram–aware text representations.

Abstract

Pre-trained text encoders have drawn sustaining attention in natural language processing (NLP) and shown their capability in obtaining promising results in different tasks. Recent studies illustrated that external self-supervised signals (or knowledge extracted by unsupervised learning, such as n-grams) are beneficial to provide useful semantic evidence for understanding languages such as Chinese, so as to improve the performance on various downstream tasks accordingly. To further enhance the encoders, in this paper, we propose to pre-train n-gram-enhanced encoders with a large volume of data and advanced techniques for training. Moreover, we try to extend the encoder to different languages as well as different domains, where it is confirmed that the same architecture is applicable to these varying circumstances and new state-of-the-art performance is observed from a long list of NLP tasks across languages and domains.

Paper Structure

This paper contains 18 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An illustration of the refined n-gram representations and their application to character encoder, where n-grams and their representations associated to the character "幻" (highlighted in blue) are weighted.
  • Figure 2: An illustration of the differences between character masking (ZEN 1.0) and n-gram masking (ZEN 2.0) with a given input text. Masked characters are represented by [M]. For ZEN 2.0, adjacent character n-grams obtained from an off-the-shelf tokenizer (segmenter) are combined into a new n-gram (highlighted in blue) if that n-gram appears in the n-gram lexicon. In the given example, to predict the masked character "儿" (son) highlighted in green, ZEN 1.0 relies more on its preceding characters "一" (one) and "会" (meeting) highlighted in yellow (because "一会儿" (a while) is a frequent phrase in Chinese), while ZEN 2.0 is designed to learn information from large text granularity (e.g., "一会儿" highlighted in yellow in the clause "一会儿乌云密布") with all three characters masked together by whole n-gram masking.
  • Figure 3: The illustration of the process to model the relative positional information (i.e., $\mathbf{R}^{*}$) in each head of the multi-head attention layer in the character encoder, where "MatMul" refers to matrix multiplication, $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ are the query, key, and value matrices, respectively, with $\mathbf{u}$ and $\mathbf{v}$ the trainable bias vectors.
  • Figure 4: The performance of different models on NLI (a) and MRC (b) with respect to the number of pre-training steps (in thousands), where the curves of BERT (L), ZEN 1.0 (L), and ZEN 2.0 (L) are illustrated in blue, orange, and green colors, respectively. The evaluation metric for NLI is accuracy and that for MRC is the F1 score.
  • Figure 5: Visualization of n-gram representations for some examples. The distance between two n-grams illustrates the similarity between their representations, where a low distance indicates the two n-grams have similar representations. N-grams in the same cluster are represented in the same color.
  • ...and 2 more figures