Table of Contents
Fetching ...

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Zeyu Sun, Jingjing Liang, Weiyi Wang, Chenyao Suo, Junjie Chen, Fanjiang Xu

TL;DR

FLEX introduces a self-adaptive fuzzing framework for MLIR that learns to generate diverse, semantically valid test inputs by coupling neural generation with a feedback loop. Starting from a small seed corpus, FLEX fine-tunes a CodeGen-2B-based generator using LoRA, generates perturbed programs, and augments the training set with diverse valid variants, iterating to reveal crashes. In 30 days, FLEX found 80 previously unknown bugs and, in 24-hour runs, detected 53 bugs with substantially higher code coverage than four strong baselines, supported by ablation studies showing the necessity of perturbation and diversity mechanisms. The results demonstrate that learning-based, self-adaptive fuzzing can markedly improve MLIR robustness and offer insights for applying similar strategies to other compiler infrastructures.

Abstract

MLIR (Multi-Level Intermediate Representation) has rapidly become a foundational technology for modern compiler frameworks, enabling extensibility across diverse domains. However, ensuring the correctness and robustness of MLIR itself remains challenging. Existing fuzzing approaches-based on manually crafted templates or rule-based mutations-struggle to generate sufficiently diverse and semantically valid test cases, making it difficult to expose subtle or deep-seated bugs within MLIR's complex and evolving code space. In this paper, we present FLEX, a novel self-adaptive fuzzing framework for MLIR. FLEX leverages neural networks for program generation, a perturbed sampling strategy to encourage diversity, and a feedback-driven augmentation loop that iteratively improves its model using both crashing and non-crashing test cases. Starting from a limited seed corpus, FLEX progressively learns valid syntax and semantics and autonomously produces high-quality test inputs. We evaluate FLEX on the upstream MLIR compiler against four state-of-the-art fuzzers. In a 30-day campaign, FLEX discovers 80 previously unknown bugs-including multiple new root causes and parser bugs-while in 24-hour fixed-revision comparisons, it detects 53 bugs (over 3.5x as many as the best baseline) and achieves 28.2% code coverage, outperforming the next-best tool by 42%. Ablation studies further confirm the critical role of both perturbed generation and diversity augmentation in FLEX's effectiveness.

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

TL;DR

FLEX introduces a self-adaptive fuzzing framework for MLIR that learns to generate diverse, semantically valid test inputs by coupling neural generation with a feedback loop. Starting from a small seed corpus, FLEX fine-tunes a CodeGen-2B-based generator using LoRA, generates perturbed programs, and augments the training set with diverse valid variants, iterating to reveal crashes. In 30 days, FLEX found 80 previously unknown bugs and, in 24-hour runs, detected 53 bugs with substantially higher code coverage than four strong baselines, supported by ablation studies showing the necessity of perturbation and diversity mechanisms. The results demonstrate that learning-based, self-adaptive fuzzing can markedly improve MLIR robustness and offer insights for applying similar strategies to other compiler infrastructures.

Abstract

MLIR (Multi-Level Intermediate Representation) has rapidly become a foundational technology for modern compiler frameworks, enabling extensibility across diverse domains. However, ensuring the correctness and robustness of MLIR itself remains challenging. Existing fuzzing approaches-based on manually crafted templates or rule-based mutations-struggle to generate sufficiently diverse and semantically valid test cases, making it difficult to expose subtle or deep-seated bugs within MLIR's complex and evolving code space. In this paper, we present FLEX, a novel self-adaptive fuzzing framework for MLIR. FLEX leverages neural networks for program generation, a perturbed sampling strategy to encourage diversity, and a feedback-driven augmentation loop that iteratively improves its model using both crashing and non-crashing test cases. Starting from a limited seed corpus, FLEX progressively learns valid syntax and semantics and autonomously produces high-quality test inputs. We evaluate FLEX on the upstream MLIR compiler against four state-of-the-art fuzzers. In a 30-day campaign, FLEX discovers 80 previously unknown bugs-including multiple new root causes and parser bugs-while in 24-hour fixed-revision comparisons, it detects 53 bugs (over 3.5x as many as the best baseline) and achieves 28.2% code coverage, outperforming the next-best tool by 42%. Ablation studies further confirm the critical role of both perturbed generation and diversity augmentation in FLEX's effectiveness.

Paper Structure

This paper contains 29 sections, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The overview of FLEX.
  • Figure 2: Illustrative example for different root causes
  • Figure 3: Bug count over time for each method
  • Figure 4: Overlap of bugs found by each method
  • Figure 5: Line coverage (%) over time
  • ...and 1 more figures