VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity
S. VenkataKeerthy, Soumya Banerjee, Sayan Dey, Yashas Andaluri, Raghul PS, Subrahmanyam Kalyanasundaram, Fernando Magno Quintão Pereira, Ramakrishna Upadrasta
TL;DR
VexIR2Vec addresses binary similarity under adversarial compilation by using architecture-neutral VEX-IR and a peephole-based representation learned through a three-phase pipeline: Phase I derives normalized peepholes from CFGs; Phase II pre-trains an unsupervised vocabulary via knowledge graph embeddings to map IR entities to vectors; Phase III fine-tunes function embeddings with VexNet, a Siamese network with global attention that combines O/T/A/S/L features. The method achieves high precision and scalability across diffing and searching tasks, outperforming state-of-the-art baselines by substantial margins in cross-optimization, cross-compiler, cross-architecture, and obfuscation settings, while remaining efficient and open-source. Core innovations include the VEX-IR Normalization Engine (VexINE), random-walk peephole generation to capture structural and semantic signals, and an entity-level vocabulary that mitigates OOV issues. Together, these components yield robust, architecture-agnostic binary embeddings suitable for large-scale diffing and retrieval, with practical implications for vulnerability analysis, malware detection, and software provenance. The work demonstrates that a lightweight, interpretable embedding framework can rival or surpass heavier ML approaches while reducing compute and data requirements, enabling broader adoption in real-world security workflows.
Abstract
Binary similarity involves determining whether two binary programs exhibit similar functionality, often originating from the same source code. In this work, we propose VexIR2Vec, an approach for binary similarity using VEX-IR, an architecture-neutral Intermediate Representation (IR). We extract the embeddings from sequences of basic blocks, termed peepholes, derived by random walks on the control-flow graph. The peepholes are normalized using transformations inspired by compiler optimizations. The VEX-IR Normalization Engine mitigates, with these transformations, the architectural and compiler-induced variations in binaries while exposing semantic similarities. We then learn the vocabulary of representations at the entity level of the IR using the knowledge graph embedding techniques in an unsupervised manner. This vocabulary is used to derive function embeddings for similarity assessment using VexNet, a feed-forward Siamese network designed to position similar functions closely and separate dissimilar ones in an n-dimensional space. This approach is amenable for both diffing and searching tasks, ensuring robustness against Out-Of-Vocabulary (OOV) issues. We evaluate VexIR2Vec on a dataset comprising 2.7M functions and 15.5K binaries from 7 projects compiled across 12 compilers targeting x86 and ARM architectures. In diffing experiments, VexIR2Vec outperforms the nearest baselines by $40\%$, $18\%$, $21\%$, and $60\%$ in cross-optimization, cross-compilation, cross-architecture, and obfuscation settings, respectively. In the searching experiment, VexIR2Vec achieves a mean average precision of $0.76$, outperforming the nearest baseline by $46\%$. Our framework is highly scalable and is built as a lightweight, multi-threaded, parallel library using only open-source tools. VexIR2Vec is $3.1$-$3.5 \times$ faster than the closest baselines and orders-of-magnitude faster than other tools.
