Table of Contents
Fetching ...

ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

Yulong He, Artem Ermakov, Sergey Kovalchuk, Artem Aliev, Dmitry Shalymov

TL;DR

This work addresses the lack of public resources for ArkTS code intelligence by constructing ArkTS-CodeSearch, a large-scale dataset from GitHub and Gitee for code retrieval. It adopts a CodeSearchNet–style single-search task, uses tree-sitter-arkts to produce accurate docstring–function pairs, and benchmarks open-source embeddings while demonstrating substantial gains through ArkTS-specific contrastive fine-tuning. The study provides a standardized benchmark for ArkTS code retrieval and releases both the dataset and a fine-tuned embedding model to foster reproducibility and future research in the OpenHarmony ecosystem. The results underscore the importance of domain- and language-aware supervision, with mid-sized, language-aligned models achieving competitive performance after fine-tuning.

Abstract

ArkTS is a core programming language in the OpenHarmony ecosystem, yet research on ArkTS code intelligence is hindered by the lack of public datasets and evaluation benchmarks. This paper presents a large-scale ArkTS dataset constructed from open-source repositories, targeting code retrieval and code evaluation tasks. We design a single-search task, where natural language comments are used to retrieve corresponding ArkTS functions. ArkTS repositories are crawled from GitHub and Gitee, and comment-function pairs are extracted using tree-sitter-arkts, followed by cross-platform deduplication and statistical analysis of ArkTS function types. We further evaluate all existing open-source code embedding models on the single-search task and perform fine-tuning using both ArkTS and TypeScript training datasets, resulting in a high-performing model for ArkTS code understanding. This work establishes the first systematic benchmark for ArkTS code retrieval. Both the dataset and our fine-tuned model will be released publicly and are available at https://huggingface.co/hreyulog/embedinggemma_arkts and https://huggingface.co/datasets/hreyulog/arkts-code-docstring,establishing the first systematic benchmark for ArkTS code retrieval.

ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

TL;DR

This work addresses the lack of public resources for ArkTS code intelligence by constructing ArkTS-CodeSearch, a large-scale dataset from GitHub and Gitee for code retrieval. It adopts a CodeSearchNet–style single-search task, uses tree-sitter-arkts to produce accurate docstring–function pairs, and benchmarks open-source embeddings while demonstrating substantial gains through ArkTS-specific contrastive fine-tuning. The study provides a standardized benchmark for ArkTS code retrieval and releases both the dataset and a fine-tuned embedding model to foster reproducibility and future research in the OpenHarmony ecosystem. The results underscore the importance of domain- and language-aware supervision, with mid-sized, language-aligned models achieving competitive performance after fine-tuning.

Abstract

ArkTS is a core programming language in the OpenHarmony ecosystem, yet research on ArkTS code intelligence is hindered by the lack of public datasets and evaluation benchmarks. This paper presents a large-scale ArkTS dataset constructed from open-source repositories, targeting code retrieval and code evaluation tasks. We design a single-search task, where natural language comments are used to retrieve corresponding ArkTS functions. ArkTS repositories are crawled from GitHub and Gitee, and comment-function pairs are extracted using tree-sitter-arkts, followed by cross-platform deduplication and statistical analysis of ArkTS function types. We further evaluate all existing open-source code embedding models on the single-search task and perform fine-tuning using both ArkTS and TypeScript training datasets, resulting in a high-performing model for ArkTS code understanding. This work establishes the first systematic benchmark for ArkTS code retrieval. Both the dataset and our fine-tuned model will be released publicly and are available at https://huggingface.co/hreyulog/embedinggemma_arkts and https://huggingface.co/datasets/hreyulog/arkts-code-docstring,establishing the first systematic benchmark for ArkTS code retrieval.
Paper Structure (28 sections, 4 equations, 7 figures, 2 tables)

This paper contains 28 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Workflow for creating ArkTS-CodeSearch.
  • Figure 2: Overview of the CodeSearchNet-style retrieval framework. Docstrings and function code are encoded by a shared encoder into a unified embedding space. The model is trained with contrastive learning and retrieves functions based on cosine similarity.
  • Figure 3: Distribution of source
  • Figure 4: Distribution of AST Length (# of nodes)
  • Figure 5: Distribution of Docstring Length (characters)
  • ...and 2 more figures