Table of Contents
Fetching ...

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

Burak Suyunu, Enes Taylan, Arzucan Özgür

TL;DR

This study evaluates three prominent tokenization approaches, Byte-Pair Encoding, WordPiece, and SentencePiece, across varying vocabulary sizes, analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws.

Abstract

Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400-6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins.

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

TL;DR

This study evaluates three prominent tokenization approaches, Byte-Pair Encoding, WordPiece, and SentencePiece, across varying vocabulary sizes, analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws.

Abstract

Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400-6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins.

Paper Structure

This paper contains 13 sections, 1 equation, 9 figures.

Figures (9)

  • Figure 1: The plot of percentage of shared tokens between different pairs of tokenizers across different vocabulary sizes.
  • Figure 2: The plots of average lengths of tokens in (a) vocabulary and (b) test data and (c) fertility scores of BPE, WordPiece, and SentencePiece across different vocabulary sizes.
  • Figure 3: Number of distinct neighbors each token encounters in a width-5 window, top 350 tokens. Plots are for BPE, WordPiece, and SentencePiece across different vocabulary sizes (VS).
  • Figure 4: A domain is considered a hit if its start and end align with the beginning and end of a token, respectively. The plots show the hit percentages for BPE, WordPiece, and SentencePiece across different vocabulary sizes.
  • Figure 5: The slope values for Zipf's law plots of BPE (Protein and English), WordPiece (Protein), and SentencePiece (Protein) across different vocabulary sizes. -1 is the ideal slope value.
  • ...and 4 more figures