Knowledge of Pretrained Language Models on Surface Information of Tokens
Tatsuya Hiraoka, Naoaki Okazaki
TL;DR
This work probes whether pretrained language models encode token surface information in their embeddings and whether they can generate such information. Using independent MLP probes on subword- and word-level embeddings across English and Japanese PLMs, plus a zero-shot generation evaluation, the authors examine token length, substrings, and token constitution. They find that embeddings encode partial knowledge of token length and substrings, but largely fail to capture the exact ordering of characters within tokens, with notable decoder bottlenecks limiting generation of surface information. The findings suggest a need for surface-aware PLMs and motivate future cross-linguistic benchmarks to better support tasks requiring fine-grained surface knowledge.
Abstract
Do pretrained language models have knowledge regarding the surface information of tokens? We examined the surface information stored in word or subword embeddings acquired by pretrained language models from the perspectives of token length, substrings, and token constitution. Additionally, we evaluated the ability of models to generate knowledge regarding token surfaces. We focused on 12 pretrained language models that were mainly trained on English and Japanese corpora. Experimental results demonstrate that pretrained language models have knowledge regarding token length and substrings but not token constitution. Additionally, the results imply that there is a bottleneck on the decoder side in terms of effectively utilizing acquired knowledge.
