Table of Contents
Fetching ...

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

Le Chen, Nuo Xu, Winson Chen, Bin Lei, Pei-Hung Lin, Dunzhi Zhou, Rajeev Thakur, Caiwen Ding, Ali Jannesari, Chunhua Liao

TL;DR

The paper tackles the difficulty of translating code in low-resource domains by introducing a dual-LLM dialogue-based data generation pipeline that integrates compiler and runtime feedback. By collecting multi-turn dialogues and intermediate reasoning (Questioner–Solver) along with verified translations, the authors fine-tune open-weight models to substantially improve compilation, execution, and unit-test success on Fortran→C++ and C++→CUDA. They demonstrate that dialogue-centered supervision outperforms traditional code-pair training, achieve strong results with midsize open models that rival proprietary systems, and release large-scale, diverse datasets with multiple supervision formats. This work proposes a general, scalable paradigm for domain-specific code translation that can accelerate HPC modernization and reduce reliance on large closed models.

Abstract

Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran -> C++ and C++ -> CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

TL;DR

The paper tackles the difficulty of translating code in low-resource domains by introducing a dual-LLM dialogue-based data generation pipeline that integrates compiler and runtime feedback. By collecting multi-turn dialogues and intermediate reasoning (Questioner–Solver) along with verified translations, the authors fine-tune open-weight models to substantially improve compilation, execution, and unit-test success on Fortran→C++ and C++→CUDA. They demonstrate that dialogue-centered supervision outperforms traditional code-pair training, achieve strong results with midsize open models that rival proprietary systems, and release large-scale, diverse datasets with multiple supervision formats. This work proposes a general, scalable paradigm for domain-specific code translation that can accelerate HPC modernization and reduce reliance on large closed models.

Abstract

Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran -> C++ and C++ -> CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.

Paper Structure

This paper contains 31 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: High-level architecture of the Questioner–Solver module in our design to replace the single LLM core in regular LLM agent frameworks. The Questioner analyzes state and formulates queries using dialogue memory and external tools (e.g., compilers, runtime environments, scripts), while the Solver generates translations, unit tests, and repairs. Their iterative interaction enables reasoning separation, external knowledge integration, and progressive refinement of translations.
  • Figure 2: Multi-Turn Dialogue Dataset Generation Pipeline.
  • Figure 3: Impact of Fine-tuning and Debug Rounds on Fortran to C++ Translation Success (Qwen2.5-Coder-7B).
  • Figure 4: Impact of Fine-tuning and Debug Rounds on C++ to CUDA Translation Success (CodeLlama-13B).