Table of Contents
Fetching ...

Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

Zian Su, Xiangzhe Xu, Ziyang Huang, Kaiyuan Zhang, Xiangyu Zhang

TL;DR

A novel probe-and-recover framework that incorporates a binary-source encoder-decoder model and black-box LLMs for binary analysis that leverages the pre-trained knowledge within SCFMs to synthesize relevant, symbol-rich code fragments as context that enables black-box LLMs to enhance recovery accuracy.

Abstract

Human-Oriented Binary Reverse Engineering (HOBRE) lies at the intersection of binary and source code, aiming to lift binary code to human-readable content relevant to source code, thereby bridging the binary-source semantic gap. Recent advancements in uni-modal code model pre-training, particularly in generative Source Code Foundation Models (SCFMs) and binary understanding models, have laid the groundwork for transfer learning applicable to HOBRE. However, existing approaches for HOBRE rely heavily on uni-modal models like SCFMs for supervised fine-tuning or general LLMs for prompting, resulting in sub-optimal performance. Inspired by recent progress in large multi-modal models, we propose that it is possible to harness the strengths of uni-modal code models from both sides to bridge the semantic gap effectively. In this paper, we introduce a novel probe-and-recover framework that incorporates a binary-source encoder-decoder model and black-box LLMs for binary analysis. Our approach leverages the pre-trained knowledge within SCFMs to synthesize relevant, symbol-rich code fragments as context. This additional context enables black-box LLMs to enhance recovery accuracy. We demonstrate significant improvements in zero-shot binary summarization and binary function name recovery, with a 10.3% relative gain in CHRF and a 16.7% relative gain in a GPT4-based metric for summarization, as well as a 6.7% and 7.4% absolute increase in token-level precision and recall for name recovery, respectively. These results highlight the effectiveness of our approach in automating and improving binary code analysis.

Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

TL;DR

A novel probe-and-recover framework that incorporates a binary-source encoder-decoder model and black-box LLMs for binary analysis that leverages the pre-trained knowledge within SCFMs to synthesize relevant, symbol-rich code fragments as context that enables black-box LLMs to enhance recovery accuracy.

Abstract

Human-Oriented Binary Reverse Engineering (HOBRE) lies at the intersection of binary and source code, aiming to lift binary code to human-readable content relevant to source code, thereby bridging the binary-source semantic gap. Recent advancements in uni-modal code model pre-training, particularly in generative Source Code Foundation Models (SCFMs) and binary understanding models, have laid the groundwork for transfer learning applicable to HOBRE. However, existing approaches for HOBRE rely heavily on uni-modal models like SCFMs for supervised fine-tuning or general LLMs for prompting, resulting in sub-optimal performance. Inspired by recent progress in large multi-modal models, we propose that it is possible to harness the strengths of uni-modal code models from both sides to bridge the semantic gap effectively. In this paper, we introduce a novel probe-and-recover framework that incorporates a binary-source encoder-decoder model and black-box LLMs for binary analysis. Our approach leverages the pre-trained knowledge within SCFMs to synthesize relevant, symbol-rich code fragments as context. This additional context enables black-box LLMs to enhance recovery accuracy. We demonstrate significant improvements in zero-shot binary summarization and binary function name recovery, with a 10.3% relative gain in CHRF and a 16.7% relative gain in a GPT4-based metric for summarization, as well as a 6.7% and 7.4% absolute increase in token-level precision and recall for name recovery, respectively. These results highlight the effectiveness of our approach in automating and improving binary code analysis.
Paper Structure (47 sections, 8 equations, 13 figures, 7 tables)

This paper contains 47 sections, 8 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: The ProRec Framework for human-oriented binary reverse engineering. The figure shows a simple example of lifting a cumsum function from binary to human readable summarization. The probed contexts synthesized by the cross-modal knowledge prober, while not identical to the oracle source code of the query binary, exhibit informativeness in terms of symbol names and correct loop structure. These contexts help the black-box LLMs to successfully recover the high-level functionality of binary function in the summary that is consistent with the source code summary, moving beyond merely describing its low-level operations.
  • Figure 2: The prober architecture and compute-efficient alignment with limited trainable parameters.
  • Figure 3: Negative log-likelihoods of source functions estimated by base SCLM and those conditioned on its binary counterpart estimated by the aligned prober.
  • Figure 4: Scores from our proposed GPT4 evaluator for summaries generated basd on GPT3.5-turbo. The x-axes denote context relevance (left) and functionality (right), respectively. Larger scores are better. Bars denote the number of summaries with the corresponding score, and dashed lines denote the number of summaries with at least the corresponding score.
  • Figure 5: Binary function name recovery results with and without LLM's internal analysis by using top-$k$ additional contexts on 100 examples.
  • ...and 8 more figures