Table of Contents
Fetching ...

Columbo: Low Level End-to-End System Traces through Modular Full-System Simulation

Jakob Görgen, Vaastav Anand, Hejing Li, Jialin Li, Antoine Kaufmann

TL;DR

Columbo addresses the difficulty of understanding performance in heterogeneous cloud systems by leveraging detailed full-system simulations to capture fine-grained hardware-software events and assemble them into end-to-end traces. It integrates modular simulations with distributed tracing by converting simulator logs into type-specific event streams, linking them via pipelines and SpanWeavers, and exporting to existing tracing backends. A key contribution is the online analysis capability, which builds traces during simulation without persisting large logs, and the hardware-enriched traces produced by cross-layer event integration. The clock synchronization case study demonstrates how such traces expose root causes (e.g., switch delays) that app-layer traces miss, underscoring the practical value for debugging and optimizing heterogeneous systems.

Abstract

Fully understanding performance is a growing challenge when building next-generation cloud systems. Often these systems build on next-generation hardware, and evaluation in realistic physical testbeds is out of reach. Even when physical testbeds are available, visibility into essential system aspects is a challenge in modern systems where system performance depends on often sub-$μs$ interactions between HW and SW components. Existing tools such as performance counters, logging, and distributed tracing provide aggregate or sampled information, but remain insufficient for understanding individual requests in-depth. In this paper, we explore a fundamentally different approach to enable in-depth understanding of cloud system behavior at the software and hardware level, with (almost) arbitrarily fine-grained visibility. Our proposal is to run cloud systems in detailed full-system simulations, configure the simulators to collect detailed events without affecting the system, and finally assemble these events into end-to-end system traces that can be analyzed by existing distributed tracing tools.

Columbo: Low Level End-to-End System Traces through Modular Full-System Simulation

TL;DR

Columbo addresses the difficulty of understanding performance in heterogeneous cloud systems by leveraging detailed full-system simulations to capture fine-grained hardware-software events and assemble them into end-to-end traces. It integrates modular simulations with distributed tracing by converting simulator logs into type-specific event streams, linking them via pipelines and SpanWeavers, and exporting to existing tracing backends. A key contribution is the online analysis capability, which builds traces during simulation without persisting large logs, and the hardware-enriched traces produced by cross-layer event integration. The clock synchronization case study demonstrates how such traces expose root causes (e.g., switch delays) that app-layer traces miss, underscoring the practical value for debugging and optimizing heterogeneous systems.

Abstract

Fully understanding performance is a growing challenge when building next-generation cloud systems. Often these systems build on next-generation hardware, and evaluation in realistic physical testbeds is out of reach. Even when physical testbeds are available, visibility into essential system aspects is a challenge in modern systems where system performance depends on often sub- interactions between HW and SW components. Existing tools such as performance counters, logging, and distributed tracing provide aggregate or sampled information, but remain insufficient for understanding individual requests in-depth. In this paper, we explore a fundamentally different approach to enable in-depth understanding of cloud system behavior at the software and hardware level, with (almost) arbitrarily fine-grained visibility. Our proposal is to run cloud systems in detailed full-system simulations, configure the simulators to collect detailed events without affecting the system, and finally assemble these events into end-to-end system traces that can be analyzed by existing distributed tracing tools.
Paper Structure (18 sections, 6 figures, 1 table)

This paper contains 18 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Columbo overview
  • Figure 2: Columbo pipelines overview
  • Figure 3: Columbo evaluation topology
  • Figure 4: Measured difference between the system clocks of the client and server
  • Figure 5: Clock offset between client and server clocks estimated by chrony
  • ...and 1 more figures