BLens: Contrastive Captioning of Binary Functions using Ensemble Embedding
Tristan Benoit, Yunru Wang, Moritz Dannehl, Johannes Kinder
TL;DR
BLens reframes function-name prediction for stripped binaries as a multimodal captioning task, aligning binary function patches with lexical function names through a contrastive learning objective. It introduces Combo to fuse multiple binary embeddings into coherent function patches and Lord, a MLM-based decoder with flexible autoregression that prioritizes precision. The approach achieves state-of-the-art results across cross-binary and cross-project settings, with substantial gains in $F_1$, RougeL, and Bleu metrics and strong generalization under distribution shifts. The work demonstrates practical potential for more reliable reverse-engineering tooling and provides open-source artifacts to support reproducibility and further research.
Abstract
Function names can greatly aid human reverse engineers, which has spurred the development of machine learning-based approaches to predicting function names in stripped binaries. Much current work in this area now uses transformers, applying a metaphor of machine translation from code to function names. Still, function naming models face challenges in generalizing to projects unrelated to the training set. In this paper, we take a completely new approach by transferring advances in automated image captioning to the domain of binary reverse engineering, such that different parts of a binary function can be associated with parts of its name. We propose BLens, which combines multiple binary function embeddings into a new ensemble representation, aligns it with the name representation latent space via a contrastive learning approach, and generates function names with a transformer architecture tailored for function names. Our experiments demonstrate that BLens significantly outperforms the state of the art. In the usual setting of splitting per binary, we achieve an $F_1$ score of 0.79 compared to 0.70. In the cross-project setting, which emphasizes generalizability, we achieve an $F_1$ score of 0.46 compared to 0.29. Finally, in an experimental setting reducing shared components across projects, we achieve an $F_1$ score of $0.32$ compared to $0.19$.
