A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, Weihua Zhang
TL;DR
The paper tackles the challenge of applying learning-based dynamic binary translation to system-level emulation, where CPU-state maintenance, address translation, and interrupts introduce substantial overhead. It develops a prototype on QEMU that learns one-step translations and then optimizes the system-level workflow through coordination overhead reduction, coordination elimination, and instruction scheduling. The key contributions are the targeted optimizations (lazy parsing, inter-TB and memory-operation optimizations, and define-before-use/interrupt-driven scheduling) and empirical results showing average speedups of 1.36× on SPEC CINT2006 and 1.15× on real-world apps, along with substantial reductions in CPU-state coordination overhead. This work demonstrates the feasibility and practical impact of marrying learning-based translation with system-level emulation, enabling more efficient design, debugging, and evaluation of guest OSes on heterogeneous platforms.
Abstract
System-level emulators have been used extensively for system design, debugging and evaluation. They work by providing a system-level virtual machine to support a guest operating system (OS) running on a platform with the same or different native OS that uses the same or different instruction-set architecture. For such system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based DBT approach has shown a significantly improved performance with a higher quality of translated code using automatically learned translation rules. However, it has only been applied to user-level emulation, and not yet to system-level emulation. In this paper, we explore the feasibility of applying this approach to improve system-level emulation, and use QEMU to build a prototype. ... To achieve better performance, we leverage several optimizations that include coordination overhead reduction to reduce the overhead of each coordination, and coordination elimination and code scheduling to reduce the coordination frequency. Experimental results show that it can achieve an average of 1.36X speedup over QEMU 6.1 with negligible coordination overhead in the system emulation mode using SPEC CINT2006 as application benchmarks and 1.15X on real-world applications.
