Table of Contents
Fetching ...

Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation

Parisa Hamedi, Hamed Jelodar, Samita Bai, Mohammad Meymani, Roozbeh Razavi-Far, Ali A. Ghorbani

TL;DR

This work addresses the challenge of translating assembly code to high-level C++ by introducing the first benchmark that evaluates five instruction-tuned LLMs across lexical, semantic, fluency, and efficiency dimensions. It systematically compares small and large models using a SBAN-derived dataset with manually verified C++ references, reporting trade-offs between accuracy and inference speed. Key findings show larger models yield higher fidelity while smaller models offer practical deployment benefits, underscoring no single best choice and advocating for hybrid strategies. The results provide a foundation for future work in robust, efficient program translation applicable to reverse engineering and security analysis.

Abstract

Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance, yet systematic benchmarks for evaluating large language models on this problem remain scarce. In this work, we present the first comprehensive evaluation of five state-of-the-art large language models on assembly-to-source translation. We assess model performance using a diverse set of metrics capturing lexical similarity (BLEU, ROUGE, and METEOR), semantic alignment (BERTScore), fluency (Perplexity), and efficiency (time prediction). Our results reveal clear trade-offs: while certain models excel in text similarity metrics, others demonstrate lower perplexity or faster inference times. We further provide qualitative analyses of typical model successes and failure cases, highlighting challenges such as control flow recovery and identifier reconstruction. Taken together, our benchmark offers actionable insights into the strengths and limitations of current large language models for program translation, establishing a foundation for future research in combining accuracy with efficiency for real-world applications.

Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation

TL;DR

This work addresses the challenge of translating assembly code to high-level C++ by introducing the first benchmark that evaluates five instruction-tuned LLMs across lexical, semantic, fluency, and efficiency dimensions. It systematically compares small and large models using a SBAN-derived dataset with manually verified C++ references, reporting trade-offs between accuracy and inference speed. Key findings show larger models yield higher fidelity while smaller models offer practical deployment benefits, underscoring no single best choice and advocating for hybrid strategies. The results provide a foundation for future work in robust, efficient program translation applicable to reverse engineering and security analysis.

Abstract

Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance, yet systematic benchmarks for evaluating large language models on this problem remain scarce. In this work, we present the first comprehensive evaluation of five state-of-the-art large language models on assembly-to-source translation. We assess model performance using a diverse set of metrics capturing lexical similarity (BLEU, ROUGE, and METEOR), semantic alignment (BERTScore), fluency (Perplexity), and efficiency (time prediction). Our results reveal clear trade-offs: while certain models excel in text similarity metrics, others demonstrate lower perplexity or faster inference times. We further provide qualitative analyses of typical model successes and failure cases, highlighting challenges such as control flow recovery and identifier reconstruction. Taken together, our benchmark offers actionable insights into the strengths and limitations of current large language models for program translation, establishing a foundation for future research in combining accuracy with efficiency for real-world applications.

Paper Structure

This paper contains 10 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The methodology workflow of our proposed approach.
  • Figure 4: Spider plot comparing models across metrics that are aligned with performance (higher values indicate better results): BLEU, METEOR, ROUGE-L F1, and BERTScore F1. Each axis is min–max normalized across models to enable direct comparison. The plot highlights relative strengths and weaknesses of the models, with overlapping traces indicating identical performance on these metrics.
  • Figure 5: Bubble chart showing the efficiency trade-off across models. X-axis: average prediction time (s). Y-axis: average perplexity (lower is better for both). Bubble area $\propto$ parameter count (billions), color encodes the model. Models closer to the lower-left are more efficient; the legend groups small ($\leq$4B) vs large ($>$4B) models.