Table of Contents
Fetching ...

Knowledge of Pretrained Language Models on Surface Information of Tokens

Tatsuya Hiraoka, Naoaki Okazaki

TL;DR

This work probes whether pretrained language models encode token surface information in their embeddings and whether they can generate such information. Using independent MLP probes on subword- and word-level embeddings across English and Japanese PLMs, plus a zero-shot generation evaluation, the authors examine token length, substrings, and token constitution. They find that embeddings encode partial knowledge of token length and substrings, but largely fail to capture the exact ordering of characters within tokens, with notable decoder bottlenecks limiting generation of surface information. The findings suggest a need for surface-aware PLMs and motivate future cross-linguistic benchmarks to better support tasks requiring fine-grained surface knowledge.

Abstract

Do pretrained language models have knowledge regarding the surface information of tokens? We examined the surface information stored in word or subword embeddings acquired by pretrained language models from the perspectives of token length, substrings, and token constitution. Additionally, we evaluated the ability of models to generate knowledge regarding token surfaces. We focused on 12 pretrained language models that were mainly trained on English and Japanese corpora. Experimental results demonstrate that pretrained language models have knowledge regarding token length and substrings but not token constitution. Additionally, the results imply that there is a bottleneck on the decoder side in terms of effectively utilizing acquired knowledge.

Knowledge of Pretrained Language Models on Surface Information of Tokens

TL;DR

This work probes whether pretrained language models encode token surface information in their embeddings and whether they can generate such information. Using independent MLP probes on subword- and word-level embeddings across English and Japanese PLMs, plus a zero-shot generation evaluation, the authors examine token length, substrings, and token constitution. They find that embeddings encode partial knowledge of token length and substrings, but largely fail to capture the exact ordering of characters within tokens, with notable decoder bottlenecks limiting generation of surface information. The findings suggest a need for surface-aware PLMs and motivate future cross-linguistic benchmarks to better support tasks requiring fine-grained surface knowledge.

Abstract

Do pretrained language models have knowledge regarding the surface information of tokens? We examined the surface information stored in word or subword embeddings acquired by pretrained language models from the perspectives of token length, substrings, and token constitution. Additionally, we evaluated the ability of models to generate knowledge regarding token surfaces. We focused on 12 pretrained language models that were mainly trained on English and Japanese corpora. Experimental results demonstrate that pretrained language models have knowledge regarding token length and substrings but not token constitution. Additionally, the results imply that there is a bottleneck on the decoder side in terms of effectively utilizing acquired knowledge.
Paper Structure (19 sections, 3 equations, 4 figures, 3 tables)

This paper contains 19 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Input and output examples when asking GPT-3.5 Turbo about the surface information of words (as of 1st, Jan. 2024). The Japanese example has the same meaning as the English text, asking the length of and third character in 人類学者(anthropologist).
  • Figure 2: Outline of the methods used to examine the obtained knowledge regarding surface information in word/subword embeddings ($\mathbf{v}$ in the figure). This figure shows the inputs and outputs of each method. We trained independent MLPs for each task.
  • Figure 3: Comparison of predicted lengths and true lengths of word-level inputs in the BERT-base-cased. The red line indicates correct predictions.
  • Figure 4: Prediction accuracy of the $N$th character in forward (top) and backward (bottom) experimental settings. $N=-2$ indicates the prediction accuracy for the second character counted in the backward direction from the word tail.