Table of Contents
Fetching ...

VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity

S. VenkataKeerthy, Soumya Banerjee, Sayan Dey, Yashas Andaluri, Raghul PS, Subrahmanyam Kalyanasundaram, Fernando Magno Quintão Pereira, Ramakrishna Upadrasta

TL;DR

VexIR2Vec addresses binary similarity under adversarial compilation by using architecture-neutral VEX-IR and a peephole-based representation learned through a three-phase pipeline: Phase I derives normalized peepholes from CFGs; Phase II pre-trains an unsupervised vocabulary via knowledge graph embeddings to map IR entities to vectors; Phase III fine-tunes function embeddings with VexNet, a Siamese network with global attention that combines O/T/A/S/L features. The method achieves high precision and scalability across diffing and searching tasks, outperforming state-of-the-art baselines by substantial margins in cross-optimization, cross-compiler, cross-architecture, and obfuscation settings, while remaining efficient and open-source. Core innovations include the VEX-IR Normalization Engine (VexINE), random-walk peephole generation to capture structural and semantic signals, and an entity-level vocabulary that mitigates OOV issues. Together, these components yield robust, architecture-agnostic binary embeddings suitable for large-scale diffing and retrieval, with practical implications for vulnerability analysis, malware detection, and software provenance. The work demonstrates that a lightweight, interpretable embedding framework can rival or surpass heavier ML approaches while reducing compute and data requirements, enabling broader adoption in real-world security workflows.

Abstract

Binary similarity involves determining whether two binary programs exhibit similar functionality, often originating from the same source code. In this work, we propose VexIR2Vec, an approach for binary similarity using VEX-IR, an architecture-neutral Intermediate Representation (IR). We extract the embeddings from sequences of basic blocks, termed peepholes, derived by random walks on the control-flow graph. The peepholes are normalized using transformations inspired by compiler optimizations. The VEX-IR Normalization Engine mitigates, with these transformations, the architectural and compiler-induced variations in binaries while exposing semantic similarities. We then learn the vocabulary of representations at the entity level of the IR using the knowledge graph embedding techniques in an unsupervised manner. This vocabulary is used to derive function embeddings for similarity assessment using VexNet, a feed-forward Siamese network designed to position similar functions closely and separate dissimilar ones in an n-dimensional space. This approach is amenable for both diffing and searching tasks, ensuring robustness against Out-Of-Vocabulary (OOV) issues. We evaluate VexIR2Vec on a dataset comprising 2.7M functions and 15.5K binaries from 7 projects compiled across 12 compilers targeting x86 and ARM architectures. In diffing experiments, VexIR2Vec outperforms the nearest baselines by $40\%$, $18\%$, $21\%$, and $60\%$ in cross-optimization, cross-compilation, cross-architecture, and obfuscation settings, respectively. In the searching experiment, VexIR2Vec achieves a mean average precision of $0.76$, outperforming the nearest baseline by $46\%$. Our framework is highly scalable and is built as a lightweight, multi-threaded, parallel library using only open-source tools. VexIR2Vec is $3.1$-$3.5 \times$ faster than the closest baselines and orders-of-magnitude faster than other tools.

VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity

TL;DR

VexIR2Vec addresses binary similarity under adversarial compilation by using architecture-neutral VEX-IR and a peephole-based representation learned through a three-phase pipeline: Phase I derives normalized peepholes from CFGs; Phase II pre-trains an unsupervised vocabulary via knowledge graph embeddings to map IR entities to vectors; Phase III fine-tunes function embeddings with VexNet, a Siamese network with global attention that combines O/T/A/S/L features. The method achieves high precision and scalability across diffing and searching tasks, outperforming state-of-the-art baselines by substantial margins in cross-optimization, cross-compiler, cross-architecture, and obfuscation settings, while remaining efficient and open-source. Core innovations include the VEX-IR Normalization Engine (VexINE), random-walk peephole generation to capture structural and semantic signals, and an entity-level vocabulary that mitigates OOV issues. Together, these components yield robust, architecture-agnostic binary embeddings suitable for large-scale diffing and retrieval, with practical implications for vulnerability analysis, malware detection, and software provenance. The work demonstrates that a lightweight, interpretable embedding framework can rival or surpass heavier ML approaches while reducing compute and data requirements, enabling broader adoption in real-world security workflows.

Abstract

Binary similarity involves determining whether two binary programs exhibit similar functionality, often originating from the same source code. In this work, we propose VexIR2Vec, an approach for binary similarity using VEX-IR, an architecture-neutral Intermediate Representation (IR). We extract the embeddings from sequences of basic blocks, termed peepholes, derived by random walks on the control-flow graph. The peepholes are normalized using transformations inspired by compiler optimizations. The VEX-IR Normalization Engine mitigates, with these transformations, the architectural and compiler-induced variations in binaries while exposing semantic similarities. We then learn the vocabulary of representations at the entity level of the IR using the knowledge graph embedding techniques in an unsupervised manner. This vocabulary is used to derive function embeddings for similarity assessment using VexNet, a feed-forward Siamese network designed to position similar functions closely and separate dissimilar ones in an n-dimensional space. This approach is amenable for both diffing and searching tasks, ensuring robustness against Out-Of-Vocabulary (OOV) issues. We evaluate VexIR2Vec on a dataset comprising 2.7M functions and 15.5K binaries from 7 projects compiled across 12 compilers targeting x86 and ARM architectures. In diffing experiments, VexIR2Vec outperforms the nearest baselines by , , , and in cross-optimization, cross-compilation, cross-architecture, and obfuscation settings, respectively. In the searching experiment, VexIR2Vec achieves a mean average precision of , outperforming the nearest baseline by . Our framework is highly scalable and is built as a lightweight, multi-threaded, parallel library using only open-source tools. VexIR2Vec is - faster than the closest baselines and orders-of-magnitude faster than other tools.
Paper Structure (26 sections, 1 theorem, 5 figures, 1 algorithm)

This paper contains 26 sections, 1 theorem, 5 figures, 1 algorithm.

Key Result

theorem 1

If Algorithm algorithm:random-walk is invoked on a VEX IR function with $|V|$ basic blocks in the control-flow graph and parameters $k$ and $c$, then it terminates in at most $c |V|$ steps.

Figures (5)

  • Figure 1: VexIR2Vec uses angr for disassembling the binaries and obtaining the VEX-IR; Function Embeddings for binary similarity tasks are obtained by training simple Feed Forward Neural Networks.
  • Figure 2: VEX-IR corresponding to the section of the program n3 = n1 + n2; n1 = n2; n2 = n3; to compute the Fibonacci series generated by angr from x86 and ARM binaries.
  • Figure 3: The CFG of adjust_relative_path method from elfedit of binutils project generated by GCC 10 and Clang 10 compilers under different optimizations
  • Figure 4: Heatmap showing the frequency distribution of the instructions across the binaries from FindUtils generated by Clang 10 and GCC 10 compilers with different optimization levels.
  • Figure 5: Overview of VexIR2Vec: The Control-Flow Graphs of the functions from VEX-IR are generated from different compilation configurations. Then, the peepholes are obtained and normalized by VexINE, the VEX-IR Normalization Engine by using different normalizing transformations. Function embedding is obtained from the embeddings derived from the Opcodes, Types, and Arguments, along with the strings and external library calls used in the function. This embedding is used as input to train VexNet, a simple feed-forward network in the Siamese setting designed to obtain the final embedding of the function.

Theorems & Definitions (2)

  • definition 1: Peephole
  • theorem 1