Table of Contents
Fetching ...

How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder

TL;DR

This work investigates how tokenization strategies influence the effectiveness of transformer-based models on binary code analysis, focusing on assembly code. By systematically evaluating WordPiece, Byte-Pair Encoding, and Unigram tokenizers across multiple vocabulary sizes and two preprocessing regimes, the study analyzes intrinsic properties (fertility, overlap, OOV) and extrinsic performance on masked-token and function-signature tasks. Key findings show that tokenizer choice and preprocessing materially affect downstream results, with preprocessing generally improving performance and different models exhibiting distinct sensitivities to vocabulary size and tokenization style. The results offer actionable guidance for optimizing tokenization in low-level code analysis and advocate for task- and model-aware preprocessing pipelines. The work also outlines limitations and future directions, including larger-scale models, alternative tokenization implementations, and broader real-world datasets, to enhance robustness and generalizability in binary analysis workflows.

Abstract

Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction -- a critical problem in binary code analysis. To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows.

How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

TL;DR

This work investigates how tokenization strategies influence the effectiveness of transformer-based models on binary code analysis, focusing on assembly code. By systematically evaluating WordPiece, Byte-Pair Encoding, and Unigram tokenizers across multiple vocabulary sizes and two preprocessing regimes, the study analyzes intrinsic properties (fertility, overlap, OOV) and extrinsic performance on masked-token and function-signature tasks. Key findings show that tokenizer choice and preprocessing materially affect downstream results, with preprocessing generally improving performance and different models exhibiting distinct sensitivities to vocabulary size and tokenization style. The results offer actionable guidance for optimizing tokenization in low-level code analysis and advocate for task- and model-aware preprocessing pipelines. The work also outlines limitations and future directions, including larger-scale models, alternative tokenization implementations, and broader real-world datasets, to enhance robustness and generalizability in binary analysis workflows.

Abstract

Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction -- a critical problem in binary code analysis. To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows.

Paper Structure

This paper contains 50 sections, 1 equation, 4 figures, 8 tables.

Figures (4)

  • Figure 1: An example of address-to-sequential-identifiers preprocessing. The code on the left represents the original code before preprocessing, while the code on the right shows the result after preprocessing.
  • Figure 2: Frequency distribution of disassembled functions based on the number of instructions per function.
  • Figure 3: Fertility evaluation comparison between BPE, Unigram, and WordPiece tokenizers on (a) The default disassembly dataset and (b) The Preprocessed disassembly dataset.
  • Figure 4: Vocabulary overlap heatmaps for tokenizers across default and preprocessed datasets at four vocabulary sizes: 3K, 25K, 35K, and 128K.