TransLibEval: Demystify Large Language Models' Capability in Third-party Library-targeted Code Translation
Pengyu Xue, Kunwu Zheng, Zhen Yang, Yifei Pei, Linhao Wu, Jiahui Dong, Xiapu Luo, Yan Xiao, Fei Liu, Yuxuan Zhang, Xiran Lyu, Xianhang Li, Xuanyu Zhu, Chengyi Wang
TL;DR
This work introduces TransLibEval, the first benchmark focused on library-centric code translation to rigorously evaluate LLM performance when Third-Party Libraries (TPLs) are involved. By providing 200 parallel tasks across Python, Java, and C++ with extensive library coverage and six translation strategies (Direct, IR-guided variants, and Retrieval-Augmented variants), the study reveals a substantial performance gap relative to library-free benchmarks and highlights the critical role of API/library awareness. A large-scale, expert-driven analysis of 4,831 GPT-4o failures uncovers a dominance of third-party reference errors and reveals strategy-specific failure modes, offering practical guidance for building more TPL-aware code intelligence. The results show that commercial LLMs generally outperform smaller models, Python translations are comparatively easier, and retrieval or IR-guided approaches can mitigate some library-related challenges, informing future directions for robust cross-language API mapping and library adaptation.
Abstract
In recent years, Large Language Models (LLMs) have been widely studied in the code translation field on the method, class, and even repository levels. However, most of these benchmarks are limited in terms of Third-Party Library (TPL) categories and scales, making TPL-related errors hard to expose and hindering the development of targeted solutions. Considering the high dependence (over 90%) on TPLs in practical programming, demystifying and analyzing LLMs' code translation performance involving various TPLs becomes imperative. To address this gap, we construct TransLibEval, the first benchmark dedicated to library-centric code translation. It consists of 200 real-world tasks across Python, Java, and C++, each explicitly involving TPLs from diverse categories such as data processing, machine learning, and web development, with comprehensive dependency coverage and high-coverage test suites. We evaluate seven recent LLMs of commercial, general, and code-specialized families under six translation strategies of three categories: Direct, IR-guided, and Retrieval-augmented. Experimental results show a dramatic performance drop compared with library-free settings (average CA decline over 60%), while diverse strategies demonstrate heterogeneous advantages. Furthermore, we analyze 4,831 failed cases from GPT-4o, one of the State-of-the-Art (SOTA) LLMs, revealing numerous third-party reference errors that were obscured previously. These findings highlight the unique challenges of library-centric translation and provide practical guidance for improving TPL-aware code intelligence.
