Lyra: A Benchmark for Turducken-Style Code Generation

Qingyuan Liang; Zeyu Sun; Qihao Zhu; Wenjie Zhang; Lian Yu; Yingfei Xiong; Lu Zhang

Lyra: A Benchmark for Turducken-Style Code Generation

Qingyuan Liang, Zeyu Sun, Qihao Zhu, Wenjie Zhang, Lian Yu, Yingfei Xiong, Lu Zhang

TL;DR

Lyra introduces a turducken-style code generation task and a 2,000-example Python+SQL dataset with bilingual comments to study generating an imperative base language with an embedded declarative language. The work benchmarks Transformer, BERT-style, and GPT-style models, finding GPT-style decoders deliver the strongest performance (AST exact matching around 24–25.5%), underscoring the task's difficulty and room for improvement. By detailing data construction, annotation, and rigorous evaluation metrics (BLEU, code executability, and AST-based measures), the paper provides a practical benchmark for real-world software development where declarative queries are embedded in imperative code. Lyra thus offers a foundation for advancing code generation methods that handle cross-language dependencies and embedded languages in realistic software contexts.

Abstract

Recently, neural techniques have been used to generate source code automatically. While promising for declarative languages, these approaches achieve much poorer performance on datasets for imperative languages. Since a declarative language is typically embedded in an imperative language (i.e., the turducken-style programming) in real-world software development, the promising results on declarative languages can hardly lead to significant reduction of manual software development efforts. In this paper, we define a new code generation task: given a natural language comment, this task aims to generate a program in a base imperative language with an embedded declarative language. To our knowledge, this is the first turducken-style code generation task. For this task, we present Lyra: a dataset in Python with embedded SQL. This dataset contains 2,000 carefully annotated database manipulation programs from real-world projects. Each program is paired with both a Chinese comment and an English comment. In our experiment, we adopted Transformer, BERT-style, and GPT-style models as baselines. In the best setting, the generation performance of GPT-style models is better than others, where the AST exact matching accuracy is 24% and 25.5% when using Chinese and English comments, respectively. Therefore, we believe that Lyra provides a new challenge for code generation. Yet, overcoming this challenge may significantly boost the applicability of code generation techniques for real-world software development.

Lyra: A Benchmark for Turducken-Style Code Generation

TL;DR

Abstract

Lyra: A Benchmark for Turducken-Style Code Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)