Table of Contents
Fetching ...

XMainframe: A Large Language Model for Mainframe Modernization

Anh T. V. Dau, Hieu Trung Dao, Anh Tuan Nguyen, Hieu Trung Tran, Phong X. Nguyen, Nghi D. Q. Bui

TL;DR

XMainframe introduces a domain-specialized LLM for mainframe and COBOL code, built atop DeepSeek-Coder and enhanced through a tailored data pipeline and a dedicated MainframeBench. The authors implement a two-stage training process (pretraining and instruction tuning) and employ depthwise upscaling to reach $10.5$B parameters while preserving efficiency. They construct large, high-quality pretraining and instruct datasets, including 236 million tokens and a 53,351-entry Mainframe-Instruct corpus, using data augmentation and seed-based prompting. Empirical results on MainframeBench demonstrate that XMainframe consistently outperforms public baselines across MCQ, QA, and COBOL summarization, indicating strong potential for supporting mainframe modernization and legacy system maintenance. The work highlights the importance of domain-focused data curation and benchmarks in achieving robust performance for legacy-domain AI tools, with practical impact in finance, government, and other mainframe-driven sectors.

Abstract

Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed with knowledge of mainframe legacy systems and COBOL codebases. Our solution involves the creation of an extensive data collection pipeline to produce high-quality training datasets, enhancing XMainframe's performance in this specialized domain. Additionally, we present MainframeBench, a comprehensive benchmark for assessing mainframe knowledge, including multiple-choice questions, question answering, and COBOL code summarization. Our empirical evaluations demonstrate that XMainframe consistently outperforms existing state-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30% higher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the BLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times higher than GPT-3.5 on COBOL summarization. Our work highlights the potential of XMainframe to drive significant advancements in managing and modernizing legacy systems, thereby enhancing productivity and saving time for software developers.

XMainframe: A Large Language Model for Mainframe Modernization

TL;DR

XMainframe introduces a domain-specialized LLM for mainframe and COBOL code, built atop DeepSeek-Coder and enhanced through a tailored data pipeline and a dedicated MainframeBench. The authors implement a two-stage training process (pretraining and instruction tuning) and employ depthwise upscaling to reach B parameters while preserving efficiency. They construct large, high-quality pretraining and instruct datasets, including 236 million tokens and a 53,351-entry Mainframe-Instruct corpus, using data augmentation and seed-based prompting. Empirical results on MainframeBench demonstrate that XMainframe consistently outperforms public baselines across MCQ, QA, and COBOL summarization, indicating strong potential for supporting mainframe modernization and legacy system maintenance. The work highlights the importance of domain-focused data curation and benchmarks in achieving robust performance for legacy-domain AI tools, with practical impact in finance, government, and other mainframe-driven sectors.

Abstract

Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed with knowledge of mainframe legacy systems and COBOL codebases. Our solution involves the creation of an extensive data collection pipeline to produce high-quality training datasets, enhancing XMainframe's performance in this specialized domain. Additionally, we present MainframeBench, a comprehensive benchmark for assessing mainframe knowledge, including multiple-choice questions, question answering, and COBOL code summarization. Our empirical evaluations demonstrate that XMainframe consistently outperforms existing state-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30% higher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the BLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times higher than GPT-3.5 on COBOL summarization. Our work highlights the potential of XMainframe to drive significant advancements in managing and modernizing legacy systems, thereby enhancing productivity and saving time for software developers.
Paper Structure (20 sections, 6 figures, 4 tables)

This paper contains 20 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Data Augmentation Pipeline.
  • Figure 2: Examples for Multiple Choice Question task.
  • Figure 3: Examples for Question Answering task.
  • Figure 4: Examples for COBOL summarization task.
  • Figure 5: Overview of training process.
  • ...and 1 more figures