Columbo: Low Level End-to-End System Traces through Modular Full-System Simulation
Jakob Görgen, Vaastav Anand, Hejing Li, Jialin Li, Antoine Kaufmann
TL;DR
Columbo addresses the difficulty of understanding performance in heterogeneous cloud systems by leveraging detailed full-system simulations to capture fine-grained hardware-software events and assemble them into end-to-end traces. It integrates modular simulations with distributed tracing by converting simulator logs into type-specific event streams, linking them via pipelines and SpanWeavers, and exporting to existing tracing backends. A key contribution is the online analysis capability, which builds traces during simulation without persisting large logs, and the hardware-enriched traces produced by cross-layer event integration. The clock synchronization case study demonstrates how such traces expose root causes (e.g., switch delays) that app-layer traces miss, underscoring the practical value for debugging and optimizing heterogeneous systems.
Abstract
Fully understanding performance is a growing challenge when building next-generation cloud systems. Often these systems build on next-generation hardware, and evaluation in realistic physical testbeds is out of reach. Even when physical testbeds are available, visibility into essential system aspects is a challenge in modern systems where system performance depends on often sub-$μs$ interactions between HW and SW components. Existing tools such as performance counters, logging, and distributed tracing provide aggregate or sampled information, but remain insufficient for understanding individual requests in-depth. In this paper, we explore a fundamentally different approach to enable in-depth understanding of cloud system behavior at the software and hardware level, with (almost) arbitrarily fine-grained visibility. Our proposal is to run cloud systems in detailed full-system simulations, configure the simulators to collect detailed events without affecting the system, and finally assemble these events into end-to-end system traces that can be analyzed by existing distributed tracing tools.
