Describe Data to get Science-Data-Ready Tooling: Awkward as a Target for Kaitai Struct YAML
Manasvi Goyal, Andrea Zonca, Amy Roberts, Jim Pivarski, Ianna Osborne
TL;DR
The paper tackles the challenge that many small-to-mid-scale scientific collaborations face in reading and analyzing custom binary data formats. It proposes describing formats using Kaitai Struct YAML (KSY) and generating cross-language parsers with the Kaitai Struct Compiler, then converting parsed data into Awkward Arrays via the Awkward Runtime, leveraging a C++ LayoutBuilder. The main contributions include the Awkward Target for Kaitai Struct, the AwkwardRuntime API, and a concrete workflow demonstrated on example KSYs (e.g., animalData), enabling immediate Python analysis without reimplementing tooling. This approach can significantly reduce maintenance burden and lower barriers to data analysis across disciplines that rely on legacy or bespoke data formats.
Abstract
In some fields, scientific data formats differ across experiments due to specialized hardware and data acquisition systems. Researchers need to develop, document, and maintain experiment-specific analysis software to interact with these data formats. These software are often tightly coupled with a particular data format. This proliferation of custom data formats has been a prominent challenge for small to mid-scale experiments. The widespread adoption of ROOT has largely mitigated this problem for the Large Hadron Collider experiments. However, many smaller experiments continue to use custom data formats to meet specific research needs. Therefore, simplifying the process of accessing a unique data format for analysis holds immense value for scientific communities within HEP. We have added Awkward Arrays as a target language for Kaitai Struct for this purpose. Researchers can describe their custom data format in the Kaitai Struct YAML (KSY) language. The Kaitai Struct Compiler generates C++ code to fill the LayoutBuilder buffers using the KSY format. In a few steps, the Kaitai Struct Awkward Runtime API can convert the generated C++ code into a compiled Python module. Finally, the raw data can be passed to the module to produce Awkward Arrays. This paper introduces the Awkward Target for the Kaitai Struct Compiler and the Kaitai Struct Awkward Runtime API. It also demonstrates the conversion of a given KSY for a specific custom file format to Awkward Arrays.
