XMainframe: A Large Language Model for Mainframe Modernization

Anh T. V. Dau; Hieu Trung Dao; Anh Tuan Nguyen; Hieu Trung Tran; Phong X. Nguyen; Nghi D. Q. Bui

XMainframe: A Large Language Model for Mainframe Modernization

Anh T. V. Dau, Hieu Trung Dao, Anh Tuan Nguyen, Hieu Trung Tran, Phong X. Nguyen, Nghi D. Q. Bui

TL;DR

XMainframe introduces a domain-specialized LLM for mainframe and COBOL code, built atop DeepSeek-Coder and enhanced through a tailored data pipeline and a dedicated MainframeBench. The authors implement a two-stage training process (pretraining and instruction tuning) and employ depthwise upscaling to reach $10.5$B parameters while preserving efficiency. They construct large, high-quality pretraining and instruct datasets, including 236 million tokens and a 53,351-entry Mainframe-Instruct corpus, using data augmentation and seed-based prompting. Empirical results on MainframeBench demonstrate that XMainframe consistently outperforms public baselines across MCQ, QA, and COBOL summarization, indicating strong potential for supporting mainframe modernization and legacy system maintenance. The work highlights the importance of domain-focused data curation and benchmarks in achieving robust performance for legacy-domain AI tools, with practical impact in finance, government, and other mainframe-driven sectors.

Abstract

Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed with knowledge of mainframe legacy systems and COBOL codebases. Our solution involves the creation of an extensive data collection pipeline to produce high-quality training datasets, enhancing XMainframe's performance in this specialized domain. Additionally, we present MainframeBench, a comprehensive benchmark for assessing mainframe knowledge, including multiple-choice questions, question answering, and COBOL code summarization. Our empirical evaluations demonstrate that XMainframe consistently outperforms existing state-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30% higher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the BLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times higher than GPT-3.5 on COBOL summarization. Our work highlights the potential of XMainframe to drive significant advancements in managing and modernizing legacy systems, thereby enhancing productivity and saving time for software developers.

XMainframe: A Large Language Model for Mainframe Modernization

TL;DR

B parameters while preserving efficiency. They construct large, high-quality pretraining and instruct datasets, including 236 million tokens and a 53,351-entry Mainframe-Instruct corpus, using data augmentation and seed-based prompting. Empirical results on MainframeBench demonstrate that XMainframe consistently outperforms public baselines across MCQ, QA, and COBOL summarization, indicating strong potential for supporting mainframe modernization and legacy system maintenance. The work highlights the importance of domain-focused data curation and benchmarks in achieving robust performance for legacy-domain AI tools, with practical impact in finance, government, and other mainframe-driven sectors.

Abstract

Paper Structure (20 sections, 6 figures, 4 tables)

This paper contains 20 sections, 6 figures, 4 tables.

Introduction
Related Work
Code Large Language Models
LLMs for Domain-Specific Tasks
Benchmark for COBOL and Mainframe Systems
Data Construction
Dataset for Pretraining
Dataset for Model Instruct
Overview of XMainframe
Pretrained Model
Training Details
Model Upscale
Experiments
Experimental Settings
Metrics
...and 5 more sections

Figures (6)

Figure 1: Data Augmentation Pipeline.
Figure 2: Examples for Multiple Choice Question task.
Figure 3: Examples for Question Answering task.
Figure 4: Examples for COBOL summarization task.
Figure 5: Overview of training process.
...and 1 more figures

XMainframe: A Large Language Model for Mainframe Modernization

TL;DR

Abstract

XMainframe: A Large Language Model for Mainframe Modernization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)