Table of Contents
Fetching ...

A New Approach for Evaluating the Performance of Distributed Latency-Sensitive Services

Theodoros Theodoropoulos, John Violos, Antonios Makris, Konstantinos Tserpes

TL;DR

This paper addresses the inadequacy of traditional latency metrics for modern distributed, latency-sensitive services by focusing on two SLA-relevant aspects: how often latency exceeds thresholds and how quickly latency recovers. It introduces five fault-tolerance–based latency metrics M1–M5 that reframe SLA violations as faults and derive timing and reliability-like measures from Time(No Violations($t$)) and Time(Violations($t$)). The approach is validated through a large-scale CloudSim Plus experiment comparing reactive and proactive autoscaling under a $t = 100$ ms SLA, revealing that conventional metrics can mislead about performance: proactive scaling may reduce average latency but worsen recovery dynamics due to forecast overprovisioning. The proposed metrics provide deeper, SLA-aligned insights into latency behavior and are actionable for designing SLA-aware autoscaling and deterministic latency in edge-cloud deployments.

Abstract

Conventional latency metrics are formulated based on a broad definition of traditional monolithic services, and hence lack the capacity to address the complexities inherent in modern services and distributed computing paradigms. Consequently, their effectiveness in identifying areas for improvement is restricted, falling short of providing a comprehensive evaluation of service performance within the context of contemporary services and computing paradigms. More specifically, these metrics do not offer insights into two critical aspects of service performance: the frequency of latency surpassing specified Service Level Agreement (SLA) thresholds and the time required for latency to return to an acceptable level once the threshold is exceeded. This limitation is quite significant in the frame of contemporary latency-sensitive services, and especially immersive services that require deterministic low latency that behaves in a consistent manner. Towards addressing this limitation, the authors of this work propose 5 novel latency metrics that when leveraged alongside the conventional latency metrics manage to provide advanced insights that can be potentially used to improve service performance. The validity and usefulness of the proposed metrics in the frame of providing advanced insights into service performance is evaluated using a large-scale experiment.

A New Approach for Evaluating the Performance of Distributed Latency-Sensitive Services

TL;DR

This paper addresses the inadequacy of traditional latency metrics for modern distributed, latency-sensitive services by focusing on two SLA-relevant aspects: how often latency exceeds thresholds and how quickly latency recovers. It introduces five fault-tolerance–based latency metrics M1–M5 that reframe SLA violations as faults and derive timing and reliability-like measures from Time(No Violations()) and Time(Violations()). The approach is validated through a large-scale CloudSim Plus experiment comparing reactive and proactive autoscaling under a ms SLA, revealing that conventional metrics can mislead about performance: proactive scaling may reduce average latency but worsen recovery dynamics due to forecast overprovisioning. The proposed metrics provide deeper, SLA-aligned insights into latency behavior and are actionable for designing SLA-aware autoscaling and deterministic latency in edge-cloud deployments.

Abstract

Conventional latency metrics are formulated based on a broad definition of traditional monolithic services, and hence lack the capacity to address the complexities inherent in modern services and distributed computing paradigms. Consequently, their effectiveness in identifying areas for improvement is restricted, falling short of providing a comprehensive evaluation of service performance within the context of contemporary services and computing paradigms. More specifically, these metrics do not offer insights into two critical aspects of service performance: the frequency of latency surpassing specified Service Level Agreement (SLA) thresholds and the time required for latency to return to an acceptable level once the threshold is exceeded. This limitation is quite significant in the frame of contemporary latency-sensitive services, and especially immersive services that require deterministic low latency that behaves in a consistent manner. Towards addressing this limitation, the authors of this work propose 5 novel latency metrics that when leveraged alongside the conventional latency metrics manage to provide advanced insights that can be potentially used to improve service performance. The validity and usefulness of the proposed metrics in the frame of providing advanced insights into service performance is evaluated using a large-scale experiment.
Paper Structure (6 sections, 1 table)