Table of Contents
Fetching ...

Enhancing Function Name Prediction using Votes-Based Name Tokenization and Multi-Task Learning

Xiaoling Zhang, Zhengzi Xu, Shouguo Yang, Zhi Li, Zhiqiang Shi, Limin Sun

TL;DR

Epitome tackles the challenge of predicting descriptive function names for binaries compiled under diverse optimization levels by integrating a pre-trained assembly-language model with a graph-based function semantics encoder in a multi-task framework. A key innovation is the votes-based function name tokenization that produces meaningful labels and mitigates label sparsity and out-of-vocabulary issues, complemented by a function semantics similarity task that reinforces cross-optimization semantic alignment. Empirical results show Epitome outperforms state-of-the-art tools (e.g., SymLM, NFRE) across multiple architectures and optimization levels, and demonstrates strong generalization to domain-shared and domain-deviated unseen binaries. The work also provides detailed dataset preprocessing and ablation analyses, confirming that both the semantic similarity task and robust tokenization contribute substantially to performance and practical applicability, including firmware analysis and IoT contexts.

Abstract

Reverse engineers would acquire valuable insights from descriptive function names, which are absent in publicly released binaries. Recent advances in binary function name prediction using data-driven machine learning show promise. However, existing approaches encounter difficulties in capturing function semantics in diverse optimized binaries and fail to reserve the meaning of labels in function names. We propose Epitome, a framework that enhances function name prediction using votes-based name tokenization and multi-task learning, specifically tailored for different compilation optimization binaries. Epitome learns comprehensive function semantics by pre-trained assembly language model and graph neural network, incorporating function semantics similarity prediction task, to maximize the similarity of function semantics in the context of different compilation optimization levels. In addition, we present two data preprocessing methods to improve the comprehensibility of function names. We evaluate the performance of Epitome using 2,597,346 functions extracted from binaries compiled with 5 optimizations (O0-Os) for 4 architectures (x64, x86, ARM, and MIPS). Epitome outperforms the state-of-the-art function name prediction tool by up to 44.34%, 64.16%, and 54.44% in precision, recall, and F1 score, while also exhibiting superior generalizability.

Enhancing Function Name Prediction using Votes-Based Name Tokenization and Multi-Task Learning

TL;DR

Epitome tackles the challenge of predicting descriptive function names for binaries compiled under diverse optimization levels by integrating a pre-trained assembly-language model with a graph-based function semantics encoder in a multi-task framework. A key innovation is the votes-based function name tokenization that produces meaningful labels and mitigates label sparsity and out-of-vocabulary issues, complemented by a function semantics similarity task that reinforces cross-optimization semantic alignment. Empirical results show Epitome outperforms state-of-the-art tools (e.g., SymLM, NFRE) across multiple architectures and optimization levels, and demonstrates strong generalization to domain-shared and domain-deviated unseen binaries. The work also provides detailed dataset preprocessing and ablation analyses, confirming that both the semantic similarity task and robust tokenization contribute substantially to performance and practical applicability, including firmware analysis and IoT contexts.

Abstract

Reverse engineers would acquire valuable insights from descriptive function names, which are absent in publicly released binaries. Recent advances in binary function name prediction using data-driven machine learning show promise. However, existing approaches encounter difficulties in capturing function semantics in diverse optimized binaries and fail to reserve the meaning of labels in function names. We propose Epitome, a framework that enhances function name prediction using votes-based name tokenization and multi-task learning, specifically tailored for different compilation optimization binaries. Epitome learns comprehensive function semantics by pre-trained assembly language model and graph neural network, incorporating function semantics similarity prediction task, to maximize the similarity of function semantics in the context of different compilation optimization levels. In addition, we present two data preprocessing methods to improve the comprehensibility of function names. We evaluate the performance of Epitome using 2,597,346 functions extracted from binaries compiled with 5 optimizations (O0-Os) for 4 architectures (x64, x86, ARM, and MIPS). Epitome outperforms the state-of-the-art function name prediction tool by up to 44.34%, 64.16%, and 54.44% in precision, recall, and F1 score, while also exhibiting superior generalizability.
Paper Structure (29 sections, 11 equations, 9 figures, 7 tables)

This paper contains 29 sections, 11 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Code Snippet of Function ocb_encrypt at Different Compilation Optimization Levels.
  • Figure 2: The Overall Workflow of Epitome.
  • Figure 3: CNN for Generating a Node Vector. Using 4-Dimensional Embeddings, and the Widths of the Kernels Are 2, 3, and 4, Respectively. Feature Maps of Different Kernel Widths Are Marked in Different Colors.
  • Figure 4: Comparison With SymLM on SymLM's Dataset.
  • Figure 5: Comparison With NFRE on Our x64 Dataset.
  • ...and 4 more figures