Table of Contents
Fetching ...

Tokenization Preference for Human and Machine Learning Model: An Annotation Study

Tatsuya Hiraoka, Tomoya Iwakura

TL;DR

It is shown that preferred tokenizations for humans and ML models are not necessarily always the same, which implies that existing methods using language models for tokenization could be a good compromise both for human and ML models.

Abstract

Is preferred tokenization for humans also preferred for machine-learning (ML) models? This study examines the relations between preferred tokenization for humans (appropriateness and readability) and one for ML models (performance on an NLP task). The question texts of the Japanese commonsense question-answering dataset are tokenized with six different tokenizers, and the performances of human annotators and ML models were compared. Furthermore, we analyze relations among performance of answers by human and ML model, the appropriateness of tokenization for human, and response time to questions by human. This study provides a quantitative investigation result that shows that preferred tokenizations for humans and ML models are not necessarily always the same. The result also implies that existing methods using language models for tokenization could be a good compromise both for human and ML models.

Tokenization Preference for Human and Machine Learning Model: An Annotation Study

TL;DR

It is shown that preferred tokenizations for humans and ML models are not necessarily always the same, which implies that existing methods using language models for tokenization could be a good compromise both for human and ML models.

Abstract

Is preferred tokenization for humans also preferred for machine-learning (ML) models? This study examines the relations between preferred tokenization for humans (appropriateness and readability) and one for ML models (performance on an NLP task). The question texts of the Japanese commonsense question-answering dataset are tokenized with six different tokenizers, and the performances of human annotators and ML models were compared. Furthermore, we analyze relations among performance of answers by human and ML model, the appropriateness of tokenization for human, and response time to questions by human. This study provides a quantitative investigation result that shows that preferred tokenizations for humans and ML models are not necessarily always the same. The result also implies that existing methods using language models for tokenization could be a good compromise both for human and ML models.
Paper Structure (15 sections, 3 equations, 5 figures, 3 tables)

This paper contains 15 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We investigate whether a preferred tokenization for ML model might not always be the same as one for humans and vice versa.
  • Figure 2: Annotation tool for ranking readability. The Japanese text means "What is that smoldering thing that appears when you build a fire?". UI instruction texts are translated into English for the explanation.
  • Figure 3: Annotation tool for the bag-of-words question (left), answer candidates (center), and other UI buttons (right). The original order of the Japanese text is "晴れて_いる時_に_突然_降る弱い_雨の_ことをな_んとい_う?" (random tokenization) meaning "What do you call a weak rain that suddenly falls during a sunny day?". The correct answer is "狐の嫁入り" meaning "Fox's Wedding". The texts on the UI buttons are translated into English for the explanation.
  • Figure 4: Outline of the QA system using bag-of-words (BoW) or BiLSTM to calculate the score for the question "草食動物はどれ?" (which animals are herbivores?) and the choice "ライオン" (lion).
  • Figure 5: The box plot for the relation between tokenization length and response time over all annotation results. We measured it for entire annotations mixing all tokenizers but excluded the sample whose response time is longer than 60 s.