Table of Contents
Fetching ...

ReSIM: Re-ranking Binary Similarity Embeddings to Improve Function Search Performance

Gianluca Capozzi, Anna Paola Giancaspro, Fabio Petroni, Leonardo Querzoni, Giuseppe Antonio Di Luna

TL;DR

ReSIM addresses the limitations of bi-encoder BFS systems by introducing a cross-encoder re-ranker that jointly processes query-candidate pairs in a two-stage function search pipeline. By first retrieving a window of top candidates via efficient embedding-based search and then re-ranking them with a transformer-based cross-encoder, ReSIM achieves significant improvements in $nDCG@k$ and Recall@k$ across seven embedding models and two datasets. The approach is agnostic to the underlying embedding model and leverages techniques such as LoRA with 4-bit QLoRA and hard negative fine-tuning, while also showing benefits from ensembling and pre-training transfer. Empirically, ReSIM yields substantial gains in vulnerability detection and generalizes across unseen toolchains, offering a practical enhancement for security analysis, copyright enforcement, and malware phylogeny tasks.

Abstract

Binary Function Similarity (BFS), the problem of determining whether two binary functions originate from the same source code, has been extensively studied in recent research across security, software engineering, and machine learning communities. This interest arises from its central role in developing vulnerability detection systems, copyright infringement analysis, and malware phylogeny tools. Nearly all binary function similarity systems embed assembly functions into real-valued vectors, where similar functions map to points that lie close to each other in the metric space. These embeddings enable function search: a query function is embedded and compared against a database of candidate embeddings to retrieve the most similar matches. Despite their effectiveness, such systems rely on bi-encoder architectures that embed functions independently, limiting their ability to capture cross-function relationships and similarities. To address this limitation, we introduce ReSIM, a novel and enhanced function search system that complements embedding-based search with a neural re-ranker. Unlike traditional embedding models, our reranking module jointly processes query-candidate pairs to compute ranking scores based on their mutual representation, allowing for more accurate similarity assessment. By re-ranking the top results from embedding-based retrieval, ReSIM leverages fine-grained relation information that bi-encoders cannot capture. We evaluate ReSIM across seven embedding models on two benchmark datasets, demonstrating consistent improvements in search effectiveness, with average gains of 21.7% in terms of nDCG and 27.8% in terms of Recall.

ReSIM: Re-ranking Binary Similarity Embeddings to Improve Function Search Performance

TL;DR

ReSIM addresses the limitations of bi-encoder BFS systems by introducing a cross-encoder re-ranker that jointly processes query-candidate pairs in a two-stage function search pipeline. By first retrieving a window of top candidates via efficient embedding-based search and then re-ranking them with a transformer-based cross-encoder, ReSIM achieves significant improvements in and Recall@k$ across seven embedding models and two datasets. The approach is agnostic to the underlying embedding model and leverages techniques such as LoRA with 4-bit QLoRA and hard negative fine-tuning, while also showing benefits from ensembling and pre-training transfer. Empirically, ReSIM yields substantial gains in vulnerability detection and generalizes across unseen toolchains, offering a practical enhancement for security analysis, copyright enforcement, and malware phylogeny tasks.

Abstract

Binary Function Similarity (BFS), the problem of determining whether two binary functions originate from the same source code, has been extensively studied in recent research across security, software engineering, and machine learning communities. This interest arises from its central role in developing vulnerability detection systems, copyright infringement analysis, and malware phylogeny tools. Nearly all binary function similarity systems embed assembly functions into real-valued vectors, where similar functions map to points that lie close to each other in the metric space. These embeddings enable function search: a query function is embedded and compared against a database of candidate embeddings to retrieve the most similar matches. Despite their effectiveness, such systems rely on bi-encoder architectures that embed functions independently, limiting their ability to capture cross-function relationships and similarities. To address this limitation, we introduce ReSIM, a novel and enhanced function search system that complements embedding-based search with a neural re-ranker. Unlike traditional embedding models, our reranking module jointly processes query-candidate pairs to compute ranking scores based on their mutual representation, allowing for more accurate similarity assessment. By re-ranking the top results from embedding-based retrieval, ReSIM leverages fine-grained relation information that bi-encoders cannot capture. We evaluate ReSIM across seven embedding models on two benchmark datasets, demonstrating consistent improvements in search effectiveness, with average gains of 21.7% in terms of nDCG and 27.8% in terms of Recall.
Paper Structure (34 sections, 2 equations, 17 figures, 5 tables)

This paper contains 34 sections, 2 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Embedding-based BFS pipeline. The functions $f_1$ and $f_2$ are processed in isolation by the BFS model $\phi$ to produce their embedding representations $\vec{f_1}$ and $\vec{f_2}$, which are then compared using cosine similarity to get the final score.
  • Figure 2: ReSIM pipeline. (i) The BFS bi-encoder $\phi$ maps the query function $f_q$ to an embedding $\phi(f_q)$. (ii) A similarity measure $sim(\phi(f_q),\phi(f))$ is evaluated against the embeddings of all the functions $f \in P$, and the window set $W$ containing the top-$w$ candidates is retrieved; blue dots denote embeddings of functions semantically similar to the query, whereas red dots denote dissimilar ones. (iii) Each $f\in W$ is paired with the query and scored by the re-ranker cross-encoder $\rho(f_q,f)\in[0,1]$, which reorders $W$ to produce the final top-$k$ list.
  • Figure 3: Re-ranking pipeline with a decoder-only transformer $\rho$ (cross-encoder). The two input functions $f_q$ and $f_i$ are tokenized and concatenated with a [SEP] token to form a single sequence processed by the $N$-layer decoder. A classification head reads the hidden state of the last non-padding token and outputs a logit $y$, which is then used to score and re-rank the candidates in $W$.
  • Figure 4: nDCG for CLAP and BinBERT with DEEP re-ranking ($w=200$, $k \in [1,30]$), evaluated on a pool of 25,000 functions and 5,000 queries from BinCorp.
  • Figure 5: Recall for CLAP and BinBERT with DEEP re-ranking ($w=200$ and $k \in [1,30]$), evaluated on a pool of 25,000 functions and 5,000 queries from BinCorp.
  • ...and 12 more figures