Table of Contents
Fetching ...

Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?

Oscar Novo, Oscar Bastidas-Jossa, Alberto Calvo, Antonio Peris, Carlos Kuchkovsky

Abstract

Recent advances in large language models (LLMs) have enabled the automation of an increasing number of programming tasks, including code generation for scientific and engineering domains. In rapidly evolving software ecosystems such as quantum software development, where frameworks expose complex abstractions, a central question is how best to incorporate domain knowledge into LLM-based assistants while preserving maintainability as libraries evolve. In this work, we study specialization strategies for Qiskit code generation using the Qiskit-HumanEval benchmark. We compare a parameter-specialized fine-tuned baseline introduced in prior work against a range of recent general-purpose LLMs enhanced with retrieval-augmented generation (RAG) and agent-based inference with execution feedback. Our results show that modern general-purpose LLMs consistently outperform the parameter-specialized baseline. While the fine-tuned model achieves approximately 47% pass@1 on Qiskit-HumanEval, recent general-purpose models reach 60-65% under zero-shot and retrieval-augmented settings, and up to 85% for the strongest evaluated model when combined with iterative execution-feedback agents -representing an improvement of more than 20% over zero-shot general-purpose performance and more than 35% over the parameter-specialized baseline. Agentic execution feedback yields the most consistent improvements, albeit at increased runtime cost, while RAG provides modest and model-dependent gains. These findings indicate that performance gains can be achieved without domain-specific fine-tuning, instead relying on inference-time augmentation, thereby enabling a more flexible and maintainable approach to LLM-assisted quantum software development.

Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?

Abstract

Recent advances in large language models (LLMs) have enabled the automation of an increasing number of programming tasks, including code generation for scientific and engineering domains. In rapidly evolving software ecosystems such as quantum software development, where frameworks expose complex abstractions, a central question is how best to incorporate domain knowledge into LLM-based assistants while preserving maintainability as libraries evolve. In this work, we study specialization strategies for Qiskit code generation using the Qiskit-HumanEval benchmark. We compare a parameter-specialized fine-tuned baseline introduced in prior work against a range of recent general-purpose LLMs enhanced with retrieval-augmented generation (RAG) and agent-based inference with execution feedback. Our results show that modern general-purpose LLMs consistently outperform the parameter-specialized baseline. While the fine-tuned model achieves approximately 47% pass@1 on Qiskit-HumanEval, recent general-purpose models reach 60-65% under zero-shot and retrieval-augmented settings, and up to 85% for the strongest evaluated model when combined with iterative execution-feedback agents -representing an improvement of more than 20% over zero-shot general-purpose performance and more than 35% over the parameter-specialized baseline. Agentic execution feedback yields the most consistent improvements, albeit at increased runtime cost, while RAG provides modest and model-dependent gains. These findings indicate that performance gains can be achieved without domain-specific fine-tuning, instead relying on inference-time augmentation, thereby enabling a more flexible and maintainable approach to LLM-assisted quantum software development.
Paper Structure (35 sections, 6 figures, 4 tables)

This paper contains 35 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Simplified RAG pipeline. A query is embedded and used to retrieve relevant chunks from corpora built over Qiskit documentation and/or source code (using dense FAISS or sparse BM25 retrieval), optionally reranked, and incorporated into the prompt for code generation.
  • Figure 2: Agent-based inference with execution feedback on the Qiskit-HumanEval benchmark.
  • Figure 3: RAG Retrieval: Configuration space of retrieval/scoring strategies evaluated, grouped by corpus indexing scheme.
  • Figure 4: General accuracy (top) and execution time (bottom) under different specialization strategies. We compare a training-time parameter-specialized baseline model (Param-Spec.; fine-tuned Granite reported by Dupuis et al. dupuis2024qiskitcodeassistanttraining) against general-purpose LLMs evaluated with inference-time system-level specialization: zero-shot, retrieval-augmented generation (RAG), and a single-step generate--execute--repair loop (Agent).
  • Figure 5: General accuracy (top) and execution time (bottom) across model families under multi-step agentic inference with up to five repair attempts. Results correspond to iterative generate--execute--repair loops, where models attempt to correct failed executions using unit-test feedback until success or the maximum number of repair attempts is reached. Execution time is measured over the full evaluation set of 151 benchmark tasks.
  • ...and 1 more figures