Table of Contents
Fetching ...

ArkEval: Benchmarking and Evaluating Automated CodeRepair for ArkTS

Bang Xie, Senjian Zhang, Zhiyuan Peng, Wei Chen, Chenhao Ying, Yuan Luo

TL;DR

ArkEval tackles the lack of automated repair benchmarks for ArkTS, a strict, UI-centric extension of TypeScript used in HarmonyOS. It builds a 502-issue, executable ArkTS benchmark from 149 Huawei ArkTS apps and introduces a novel LLM-Vote Test Oracle Synthesis pipeline to generate and validate regression tests where none existed. The framework delivers a RAG-enabled repair workflow, encapsulated in the ArkFix agent, and evaluates four LLMs on repository-level repair, revealing substantial gaps between current capabilities and practical ArkTS repair needs. The results underscore the challenge of domain-specific, high-assurance repair in low-resource languages and point to data-dense, domain-tailored approaches and IDE-integrated tooling as essential future directions.

Abstract

Large language models have transformed code generation, enabling unprecedented automation in software development. As mobile ecosystems evolve, HarmonyOS has emerged as a critical platform requiring robust development tools. Software development for the HarmonyOS ecosystem relies heavily on ArkTS, a statically typed extension of TypeScript. Despite its growing importance, the ecosystem lacks robust tools for automated code repair, primarily due to the absence of a high-quality benchmark for evaluation. To address this gap, we present ArkEval, a unified framework for ArkTS automated repair workflow evaluation and benchmark construction. It provides the first comprehensive benchmark specifically designed for ArkTS automated program repair. We constructed this benchmark by mining issues from a large-scale official Huawei repository containing over 400 independent ArkTS applications. Through a rigorous multi-stage filtering process, we curated 502 reproducible issues. To ensure testability, we employed a novel LLM-based test generation and voting mechanism involving Claude and other models. Furthermore, we standardized problem statements to facilitate fair evaluation. Finally, we evaluated four state-of-the-art Large Language Models (LLMs) on our benchmark using a retrieval-augmented repair workflow. Our results highlight the current capabilities and limitations of LLMs in repairing ArkTS code, paving the way for future research in this low-resource language domain.

ArkEval: Benchmarking and Evaluating Automated CodeRepair for ArkTS

TL;DR

ArkEval tackles the lack of automated repair benchmarks for ArkTS, a strict, UI-centric extension of TypeScript used in HarmonyOS. It builds a 502-issue, executable ArkTS benchmark from 149 Huawei ArkTS apps and introduces a novel LLM-Vote Test Oracle Synthesis pipeline to generate and validate regression tests where none existed. The framework delivers a RAG-enabled repair workflow, encapsulated in the ArkFix agent, and evaluates four LLMs on repository-level repair, revealing substantial gaps between current capabilities and practical ArkTS repair needs. The results underscore the challenge of domain-specific, high-assurance repair in low-resource languages and point to data-dense, domain-tailored approaches and IDE-integrated tooling as essential future directions.

Abstract

Large language models have transformed code generation, enabling unprecedented automation in software development. As mobile ecosystems evolve, HarmonyOS has emerged as a critical platform requiring robust development tools. Software development for the HarmonyOS ecosystem relies heavily on ArkTS, a statically typed extension of TypeScript. Despite its growing importance, the ecosystem lacks robust tools for automated code repair, primarily due to the absence of a high-quality benchmark for evaluation. To address this gap, we present ArkEval, a unified framework for ArkTS automated repair workflow evaluation and benchmark construction. It provides the first comprehensive benchmark specifically designed for ArkTS automated program repair. We constructed this benchmark by mining issues from a large-scale official Huawei repository containing over 400 independent ArkTS applications. Through a rigorous multi-stage filtering process, we curated 502 reproducible issues. To ensure testability, we employed a novel LLM-based test generation and voting mechanism involving Claude and other models. Furthermore, we standardized problem statements to facilitate fair evaluation. Finally, we evaluated four state-of-the-art Large Language Models (LLMs) on our benchmark using a retrieval-augmented repair workflow. Our results highlight the current capabilities and limitations of LLMs in repairing ArkTS code, paving the way for future research in this low-resource language domain.
Paper Structure (50 sections, 6 figures, 5 tables)

This paper contains 50 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The "False Friend" Trap: (a) Idiomatic TypeScript code that LLMs frequently generate is often (b) strictly illegal in ArkTS due to AOT compilation constraints.
  • Figure 2: Overview of the ArkEval benchmark for ArkTs code generation.
  • Figure 3: Overview of the ArkEval Benchmark Construction Process. The workflow proceeds through four phases: Repository Mining (identifying 149 apps), Defect Curation (filtering to 502 issues), Test Oracle Synthesis (using a multi-agent committee), and Problem Standardization.
  • Figure 4: Overview of an ArkEval Benchmark Instance. Each sample comprises: ① the buggy function signature, ② the standardized requirement/issue description, ③ the project repository structure, ④ retrieved context from official documentation, ⑤ the generated fix, and ⑥ the reproduction test case.
  • Figure 5: The ArkTS Automated Repair Framework. The pipeline proceeds from semantic fault localization (Left) to RAG-augmented patch generation (Center) and finally isolated execution and verification (Right).
  • ...and 1 more figures