FaaSter Troubleshooting -- Evaluating Distributed Tracing Approaches for Serverless Applications
Maria C. Borges, Sebastian Werner, Ahmet Kilic
TL;DR
Serverless applications present observability challenges due to multi-service fault propagation and limited platform instrumentation. The authors propose a fault observability model based on three facets of evidence—visibility, ambiguity, and inconsistency—and instantiate it for AWS Lambda and OpenWhisk. They implement and compare two tracing approaches—developer-driven tracing and platform-supported tracing—in OpenWhisk, and evaluate their effect on fault observability, latency, and resource usage. Results show that distributed tracing improves fault observability, with trade-offs in backend tooling and platform overhead, offering actionable guidance for developers and providers.
Abstract
Serverless applications can be particularly difficult to troubleshoot, as these applications are often composed of various managed and partly managed services. Faults are often unpredictable and can occur at multiple points, even in simple compositions. Each additional function or service in a serverless composition introduces a new possible fault source and a new layer to obfuscate faults. Currently, serverless platforms offer only limited support for identifying runtime faults. Developers looking to observe their serverless compositions often have to rely on scattered logs and ambiguous error messages to pinpoint root causes. In this paper, we investigate the use of distributed tracing for improving the observability of faults in serverless applications. To this end, we first introduce a model for characterizing fault observability, then provide a prototypical tracing implementation - specifically, a developer-driven and a platform-supported tracing approach. We compare both approaches with our model, measure associated trade-offs (execution latency, resource utilization), and contribute new insights for troubleshooting serverless compositions.
