Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources

Amadou Ba; Pavithra Harsha; Chitra Subramanian

Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources

Amadou Ba, Pavithra Harsha, Chitra Subramanian

TL;DR

This work tackles proactive autoscaling for microservices by building an interpretable latency prediction pipeline with a Temporal Fusion Transformer (TFT) that predicts end-to-end latency $y_{t,m}$ and exposes attention-based feature importance. When SLA violations are predicted, a Kernel Ridge Regression (KRR) step uses TFT-derived feature importances to estimate autoscaling parameters, enabling targeted horizontal (pod counts) or vertical (CPU/memory) adjustments. The approach is validated on a Robot Shop microservices application deployed on IBM Cloud, showing competitive latency predictions (using metrics like $p95$, RMSE, and $R^2$) and translating interpretability into actionable autoscaling via the L-BFGS-B optimization of $\theta$ parameters. This offers a practical, interpretable path to SLA compliance and cost-efficient resource provisioning in cloud-native environments, with a clear deployment roadmap for integration into multi-cloud management platforms.

Abstract

Modern web services adopt cloud-native principles to leverage the advantages of microservices. To consistently guarantee high Quality of Service (QoS) according to Service Level Agreements (SLAs), ensure satisfactory user experiences, and minimize operational costs, each microservice must be provisioned with the right amount of resources. However, accurately provisioning microservices with adequate resources is complex and depends on many factors, including workload intensity and the complex interconnections between microservices. To address this challenge, we develop a model that captures the relationship between an end-to-end latency, requests at the front-end level, and resource utilization. We then use the developed model to predict the end-to-end latency. Our solution leverages the Temporal Fusion Transformer (TFT), an attention-based architecture equipped with interpretability features. When the prediction results indicate SLA non-compliance, we use the feature importance provided by the TFT as covariates in Kernel Ridge Regression (KRR), with the response variable being the desired latency, to learn the parameters associated with the feature importance. These learned parameters reflect the adjustments required to the features to ensure SLA compliance. We demonstrate the merit of our approach with a microservice-based application and provide a roadmap to deployment.

Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources

TL;DR

This work tackles proactive autoscaling for microservices by building an interpretable latency prediction pipeline with a Temporal Fusion Transformer (TFT) that predicts end-to-end latency

and exposes attention-based feature importance. When SLA violations are predicted, a Kernel Ridge Regression (KRR) step uses TFT-derived feature importances to estimate autoscaling parameters, enabling targeted horizontal (pod counts) or vertical (CPU/memory) adjustments. The approach is validated on a Robot Shop microservices application deployed on IBM Cloud, showing competitive latency predictions (using metrics like

, RMSE, and

) and translating interpretability into actionable autoscaling via the L-BFGS-B optimization of

parameters. This offers a practical, interpretable path to SLA compliance and cost-efficient resource provisioning in cloud-native environments, with a clear deployment roadmap for integration into multi-cloud management platforms.

Abstract

Paper Structure (14 sections, 5 figures, 2 tables)

This paper contains 14 sections, 5 figures, 2 tables.

Introduction
Related work
Approach to proactive autoscaling
Data representation for latency prediction
Temporal Fusion Transformer for the prediction
Kernel Ridge Regression for parametric estimation
The autoscaling mechanism
Experiments
Selected end-to-end latency
End-to-end latency prediction
Feature importance associated with the predictions
Autoscaling cloud resources
Roadmap to deployment
Conclusions

Figures (5)

Figure 1: Building blocks of the proposed approach. (1) Deploying the microservices and acquiring the data. (2) Building the predictive models using Temporal Fusion Transformer. (3) Using the statistical feature importance values as new features for the KRR and building the new predictive models by fitting each KRR model to each feature importance. (4) Defining and minimizing the objective function based on the actual latency and the predicted latency. (5) Using the estimated parameters associated with each feature importance to perform autoscaling.
Figure 2: Example of traces execution and their duration.
Figure 3: Call graph using Robot Shop.
Figure 4: Example of end-to-end latency predictions at the traces level.
Figure 5: Feature importance associated with the end-to-end latency prediction.

Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources

TL;DR

Abstract

Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources

Authors

TL;DR

Abstract

Table of Contents

Figures (5)