Table of Contents
Fetching ...

Improving MPI Error Detection and Repair with Large Language Models and Bug References

Scott Piersall, Yang Gao, Shenyang Liu, Liqiang Wang

Abstract

Message Passing Interface (MPI) is a foundational technology in high-performance computing (HPC), widely used for large-scale simulations and distributed training (e.g., in machine learning frameworks such as PyTorch and TensorFlow). However, maintaining MPI programs remains challenging due to their complex interplay among processes and the intricacies of message passing and synchronization. With the advancement of large language models like ChatGPT, it is tempting to adopt such technology for automated error detection and repair. Yet, our studies reveal that directly applying large language models (LLMs) yields suboptimal results, largely because these models lack essential knowledge about correct and incorrect usage, particularly the bugs found in MPI programs. In this paper, we design a bug detection and repair technique alongside Few-Shot Learning (FSL), Chain-of-Thought (CoT) reasoning, and Retrieval Augmented Generation (RAG) techniques in LLMs to enhance the large language model's ability to detect and repair errors. Surprisingly, such enhancements lead to a significant improvement, from 44% to 77%, in error detection accuracy compared to baseline methods that use ChatGPT directly. Additionally, our experiments demonstrate our bug referencing technique generalizes well to other large language models.

Improving MPI Error Detection and Repair with Large Language Models and Bug References

Abstract

Message Passing Interface (MPI) is a foundational technology in high-performance computing (HPC), widely used for large-scale simulations and distributed training (e.g., in machine learning frameworks such as PyTorch and TensorFlow). However, maintaining MPI programs remains challenging due to their complex interplay among processes and the intricacies of message passing and synchronization. With the advancement of large language models like ChatGPT, it is tempting to adopt such technology for automated error detection and repair. Yet, our studies reveal that directly applying large language models (LLMs) yields suboptimal results, largely because these models lack essential knowledge about correct and incorrect usage, particularly the bugs found in MPI programs. In this paper, we design a bug detection and repair technique alongside Few-Shot Learning (FSL), Chain-of-Thought (CoT) reasoning, and Retrieval Augmented Generation (RAG) techniques in LLMs to enhance the large language model's ability to detect and repair errors. Surprisingly, such enhancements lead to a significant improvement, from 44% to 77%, in error detection accuracy compared to baseline methods that use ChatGPT directly. Additionally, our experiments demonstrate our bug referencing technique generalizes well to other large language models.

Paper Structure

This paper contains 37 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison of Zero-Shot, Few-Shot, Few-Shot+Chain-of-Thought (CoT), and Few-Shot+CoT+RAG prompting techniques. The inclusion of Few-Shot and CoT reasoning significantly enhances performance across all metrics.
  • Figure 2: Detailed Performance Metrics Across Experimental ChatGPT Trials: Comparison of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) across Zero-Shot, Few-Shot, CoT, and RAG experimental setups, providing an in-depth insight into the strengths and weaknesses of each method in identifying MPI program defects.
  • Figure 3: Comparison among different RAG. As the blue bar is the highest, RAG_100% is the best.
  • Figure 4: Distribution of Repair Successes and Failures by Evaluation Metric including a summary of ChatGPT repair success rates across different evaluation criteria: Successful Compilation, Resource Leak Removal, and Deadlock Removal. Depicting the specific areas of strength and opportunities for future enhancement in MPI program repair.
  • Figure 5: Comparison of Zero-Shot, Few-Shot, Few-Shot+CoT and Few-Shot+CoT+RAG True Positive(TP), True Negative(TN), False Positive(FP), and False Negative(FP) results. The inclusion of our bug referencing Few-Shot+CoT reasoning exhibits the largest improvement in true positives and reduction in false negatives across all three LLMs.
  • ...and 3 more figures