Evolutionary Generative Fuzzing for Differential Testing of the Kotlin Compiler
Calin Georgescu, Mitchell Olsthoorn, Pouria Derakhshanfar, Marat Akhin, Annibale Panichella
TL;DR
The paper tackles the oracle problem in compiler testing by applying differential testing to Kotlin compilers K1 and K2, and introduces a three-stage generative framework that couples syntactic enrichment (an enriched CFG) with semantic modeling (semantic context) and black-box search (random sampling and evolutionary fuzzing). It develops two GA variants (single- and many-objective) and a random sampling baseline to produce diverse, valid Kotlin programs and tests differential behavior between compilers. The empirical study across 50,000 generated files discovers several previously unreported bugs, including OOMs, nested-function overload conflicts, and concurrent modification exceptions, with several confirmed by JetBrains developers and some addressed in newer Kotlin releases. While RS and GA methods show similar defect counts statistically, they uncover complementary bug types, illustrating the value of combining strategies. The work provides replication resources and offers a pathway to apply similar DT-based fuzzing to other languages and compiler frontends.
Abstract
Compiler correctness is a cornerstone of reliable software development. However, systematic testing of compilers is infeasible, given the vast space of possible programs and the complexity of modern programming languages. In this context, differential testing offers a practical methodology as it addresses the oracle problem by comparing the output of alternative compilers given the same set of programs as input. In this paper, we investigate the effectiveness of differential testing in finding bugs within the Kotlin compilers developed at JetBrains. We propose a black-box generative approach that creates input programs for the K1 and K2 compilers. First, we build workable models of Kotlin semantic (semantic interface) and syntactic (enriched context-free grammar) language features, which are subsequently exploited to generate random code snippets. Second, we extend random sampling by introducing two genetic algorithms (GAs) that aim to generate more diverse input programs. Our case study shows that the proposed approach effectively detects bugs in K1 and K2; these bugs have been confirmed and (some) fixed by JetBrains developers. While we do not observe a significant difference w.r.t. the number of defects uncovered by the different search algorithms, random search and GAs are complementary as they find different categories of bugs. Finally, we provide insights into the relationships between the size, complexity, and fault detection capability of the generated input programs.
