Towards Observation Lakehouses: Living, Interactive Archives of Software Behavior
Marcus Kessel
TL;DR
The paper tackles the lack of ground-truth run-time behavior in code-generating LLMs by proposing the Observation Lakehouse, a scalable lakehouse architecture for continual Stimulus-Response Cubes (SRCs) that unifies controlled experiments and CI data. It introduces an append-only ingestion model based on invocation step records and a three-table Iceberg-Parquet-DuckDB design to enable on-the-fly SRM reconstruction, behavioral clustering, and consensus oracles without re-execution. The approach demonstrates feasibility with a benchmark of 8.6M observation rows on a laptop, achieving sub-second query latencies and high ingestion throughput, and provides an open-source implementation to foster behavior-aware evaluation and training. This work lays a practical foundation for large-scale behavioral mining of software execution and offers infrastructure paths for evolution, federation, and advanced analytics in software engineering data ecosystems.
Abstract
Code-generating LLMs are trained largely on static artifacts (source, comments, specifications) and rarely on materializations of run-time behavior. As a result, they readily internalize buggy or mislabeled code. Since non-trivial semantic properties are undecidable in general, the only practical way to obtain ground-truth functionality is by dynamic observation of executions. In prior work, we addressed representation with Sequence Sheets, Stimulus-Response Matrices (SRMs), and Stimulus-Response Cubes (SRCs) to capture and compare behavior across tests, implementations, and contexts. These structures make observation data analyzable offline and reusable, but they do not by themselves provide persistence, evolution, or interactive analytics at scale. In this paper, therefore, we introduce observation lakehouses that operationalize continual SRCs: a tall, append-only observations table storing every actuation (stimulus, response, context) and SQL queries that materialize SRC slices on demand. Built on Apache Parquet + Iceberg + DuckDB, the lakehouse ingests data from controlled pipelines (LASSO) and CI pipelines (e.g., unit test executions), enabling n-version assessment, behavioral clustering, and consensus oracles without re-execution. On a 509-problem benchmark, we ingest $\approx$8.6M observation rows ($<$51MiB) and reconstruct SRM/SRC views and clusters in $<$100ms on a laptop, demonstrating that continual behavior mining is practical without a distributed cluster of machines. This makes behavioral ground truth first-class alongside other run-time data and provides an infrastructure path toward behavior-aware evaluation and training. The Observation Lakehouse, together with the accompanying dataset, is publicly available as an open-source project on GitHub: https://github.com/SoftwareObservatorium/observation-lakehouse
