ArkEval: Benchmarking and Evaluating Automated CodeRepair for ArkTS
Bang Xie, Senjian Zhang, Zhiyuan Peng, Wei Chen, Chenhao Ying, Yuan Luo
TL;DR
ArkEval tackles the lack of automated repair benchmarks for ArkTS, a strict, UI-centric extension of TypeScript used in HarmonyOS. It builds a 502-issue, executable ArkTS benchmark from 149 Huawei ArkTS apps and introduces a novel LLM-Vote Test Oracle Synthesis pipeline to generate and validate regression tests where none existed. The framework delivers a RAG-enabled repair workflow, encapsulated in the ArkFix agent, and evaluates four LLMs on repository-level repair, revealing substantial gaps between current capabilities and practical ArkTS repair needs. The results underscore the challenge of domain-specific, high-assurance repair in low-resource languages and point to data-dense, domain-tailored approaches and IDE-integrated tooling as essential future directions.
Abstract
Large language models have transformed code generation, enabling unprecedented automation in software development. As mobile ecosystems evolve, HarmonyOS has emerged as a critical platform requiring robust development tools. Software development for the HarmonyOS ecosystem relies heavily on ArkTS, a statically typed extension of TypeScript. Despite its growing importance, the ecosystem lacks robust tools for automated code repair, primarily due to the absence of a high-quality benchmark for evaluation. To address this gap, we present ArkEval, a unified framework for ArkTS automated repair workflow evaluation and benchmark construction. It provides the first comprehensive benchmark specifically designed for ArkTS automated program repair. We constructed this benchmark by mining issues from a large-scale official Huawei repository containing over 400 independent ArkTS applications. Through a rigorous multi-stage filtering process, we curated 502 reproducible issues. To ensure testability, we employed a novel LLM-based test generation and voting mechanism involving Claude and other models. Furthermore, we standardized problem statements to facilitate fair evaluation. Finally, we evaluated four state-of-the-art Large Language Models (LLMs) on our benchmark using a retrieval-augmented repair workflow. Our results highlight the current capabilities and limitations of LLMs in repairing ArkTS code, paving the way for future research in this low-resource language domain.
