Table of Contents
Fetching ...

Bootstrapping Fuzzers for Compilers of Low-Resource Language Dialects Using Language Models

Sairam Vaidya, Marcel Böhme, Loris D'Antoni

TL;DR

Germinator tackles the challenge of fuzzing low-resource MLIR dialects by bootstrapping seeds from automatically extracted dialect grammars and refining them with grammar-constrained language-model sampling, all inside a coverage-guided fuzzing loop. The approach achieves dialect-agnostic seed generation and dialect-effective exploration, improving line coverage by up to 120% and discovering 88 previously unknown bugs across six projects. The work demonstrates how grammar extraction from TableGen and LM-based seed generation can scale testing across a heterogeneous dialect ecosystem without per-dialect engineering. It provides a practical blueprint for automating seed generation for compiler testing in extensible language ecosystems like MLIR and beyond.

Abstract

Modern extensible compiler frameworks-such as MLIR-enable rapid creation of domain-specific language dialects. This flexibility, however, makes correctness harder to ensure as the same extensibility that accelerates development also complicates maintaining the testing infrastructure. Extensible languages require automated test generation that is both dialect-agnostic (works across dialects without manual adaptation) and dialect-effective (targets dialect-specific features to find bugs). Existing approaches typically sacrifice one of these goals by either requiring manually constructed seed corpora for each dialect, or by failing to be effective. We present a dialect-agnostic and dialect-effective grammar-based and coverage-guided fuzzing approach for extensible compilers that combines two key insights from existing work: (i) the grammars of dialects, which already encode the structural and type constraints, can often be extracted automatically from the dialect specification; and (ii) these grammars can be used in combination with pre-trained large language models to automatically generate representative and diverse seed inputs from the full dialect space without requiring any manual input or training data. These seeds can then be used to bootstrap coverage-guided fuzzers. We built this approach into a tool, Germinator. When evaluated on six MLIR projects spanning 91 dialects, Germinator generated seeds improve line coverage by 10-120% over grammar-based baselines. We compare against grammar-based baselines because they are the only class of existing automatic seed generators that can be applied uniformly across MLIR's heterogeneous dialect ecosystem. Germinator discovers 88 previously unknown bugs (40 confirmed), including 23 in dialects with no prior automated test generators, demonstrating effective and controllable testing of low-resource dialects at scale.

Bootstrapping Fuzzers for Compilers of Low-Resource Language Dialects Using Language Models

TL;DR

Germinator tackles the challenge of fuzzing low-resource MLIR dialects by bootstrapping seeds from automatically extracted dialect grammars and refining them with grammar-constrained language-model sampling, all inside a coverage-guided fuzzing loop. The approach achieves dialect-agnostic seed generation and dialect-effective exploration, improving line coverage by up to 120% and discovering 88 previously unknown bugs across six projects. The work demonstrates how grammar extraction from TableGen and LM-based seed generation can scale testing across a heterogeneous dialect ecosystem without per-dialect engineering. It provides a practical blueprint for automating seed generation for compiler testing in extensible language ecosystems like MLIR and beyond.

Abstract

Modern extensible compiler frameworks-such as MLIR-enable rapid creation of domain-specific language dialects. This flexibility, however, makes correctness harder to ensure as the same extensibility that accelerates development also complicates maintaining the testing infrastructure. Extensible languages require automated test generation that is both dialect-agnostic (works across dialects without manual adaptation) and dialect-effective (targets dialect-specific features to find bugs). Existing approaches typically sacrifice one of these goals by either requiring manually constructed seed corpora for each dialect, or by failing to be effective. We present a dialect-agnostic and dialect-effective grammar-based and coverage-guided fuzzing approach for extensible compilers that combines two key insights from existing work: (i) the grammars of dialects, which already encode the structural and type constraints, can often be extracted automatically from the dialect specification; and (ii) these grammars can be used in combination with pre-trained large language models to automatically generate representative and diverse seed inputs from the full dialect space without requiring any manual input or training data. These seeds can then be used to bootstrap coverage-guided fuzzers. We built this approach into a tool, Germinator. When evaluated on six MLIR projects spanning 91 dialects, Germinator generated seeds improve line coverage by 10-120% over grammar-based baselines. We compare against grammar-based baselines because they are the only class of existing automatic seed generators that can be applied uniformly across MLIR's heterogeneous dialect ecosystem. Germinator discovers 88 previously unknown bugs (40 confirmed), including 23 in dialects with no prior automated test generators, demonstrating effective and controllable testing of low-resource dialects at scale.

Paper Structure

This paper contains 71 sections, 6 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: A valid emitc.for loop missing its induction variable. The verifier assumes this variable always exists and crashes when trying to access it.
  • Figure 2: Grammar extraction for the EmitC dialect. (a) Snippet of TableGen operation definitions: directly extracted assembly formats are highlighted in blue and custom C++ formats requiring inference are highlighted in magenta. (b) Grammar rules extracted from TableGen enforcing syntactic constraints, while semantic constraints---such as operand compatibility---are handled implicitly by the language model.
  • Figure 3: Grammar-constrained sampling ensures syntactic validity. Given the same prompt (the specific prompt is not one used by Germinator and is for illustrative purposes), (a) unconstrained LLM generation produces multiple syntax errors (highlighted in red): incorrect SSA value declarations (!i32 3 = ...), invalid loop syntax (in [%3, %4]), non-existent operations (emitc.print), and invalid module-level return. (b) Grammar-constrained sampling (highlighted in green) follows the extracted grammar rules exactly, producing syntactically valid code that respects the EmitC dialect's constraints.
  • Figure 4: Generic format.
  • Figure 5: Cumulative bug discovery over 24 hours across all six target projects. Germinator discovers substantially more bugs and finds them earlier than baselines. In Torch-MLIR and Triton, baselines discover zero bugs while Germinator finds 7 and 3 bugs respectively. Shaded regions indicate 95% bootstrapped confidence intervals across 5 trials.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Example 3.1: EmitC composes with core MLIR
  • Definition 3.1: Coverage-Weighted Target Distribution
  • Definition 3.2: Dialect-Agnostic and Dialect-Effective Fuzzing Strategy
  • Definition 3.3: Good Seed Generator
  • Definition 4.1: Constrained-Sampling Seed Generator