UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function Serving

Yifei Li; Ryan Chard; Yadu Babuji; Kyle Chard; Ian Foster; Zhuozhao Li

UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function Serving

Yifei Li, Ryan Chard, Yadu Babuji, Kyle Chard, Ian Foster, Zhuozhao Li

TL;DR

The paper addresses the challenge of running fine-grained scientific workflows across federated cyberinfrastructure with dynamic resource capacity. It proposes UniFaaS, a federated FaaS framework built on funcX, featuring a Python-based programming model with dynamic task graphs, a data manager for cross-resource transfers, and an observe-predict-decide control loop. A key contribution is the dynamic heterogeneity-aware (DHA) scheduling, which combines offline and online decisions with a delay mechanism and re-scheduling, plus a data-transfer-aware priority scheme: $priority(t_i) = ar{d_i} + ar{w_i} + ext{max}_{t_j ext{ in succ}(t_i)} priority(t_j)$ and capacity-based partitioning $M_i = M imes rac{c_i}{ig( extstyle\sum_i c_iig)}$. Experiments show UniFaaS scales to 16 endpoints with minimal scheduler overhead, yielding improvements such as up to 22.99% in drug screening and 54.41% in montage workflows when additional resources are available, confirming practical viability for cross-facility scientific computing. The work demonstrates a path toward portable, elastic, and fine-grained function-level workflow management across federated CI, enabling researchers to exploit heterogeneous resources and queue-time trade-offs efficiently.

Abstract

Modern scientific applications are increasingly decomposable into individual functions that may be deployed across distributed and diverse cyberinfrastructure such as supercomputers, clouds, and accelerators. Such applications call for new approaches to programming, distributed execution, and function-level management. We present UniFaaS, a parallel programming framework that relies on a federated function-as-a-service (FaaS) model to enable composition of distributed, scalable, and high-performance scientific workflows, and to support fine-grained function-level management. UniFaaS provides a unified programming interface to compose dynamic task graphs with transparent wide-area data management. UniFaaS exploits an observe-predict-decide approach to efficiently map workflow tasks to target heterogeneous and dynamic resources. We propose a dynamic heterogeneity-aware scheduling algorithm that employs a delay mechanism and a re-scheduling mechanism to accommodate dynamic resource capacity. Our experiments show that UniFaaS can efficiently execute workflows across computing resources with minimal scheduling overhead. We show that UniFaaS can improve the performance of a real-world drug screening workflow by as much as 22.99% when employing an additional 19.48% of resources and a montage workflow by 54.41% when employing an additional 47.83% of resources across multiple distributed clusters, in contrast to using a single cluster

UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function Serving

TL;DR

and capacity-based partitioning

. Experiments show UniFaaS scales to 16 endpoints with minimal scheduler overhead, yielding improvements such as up to 22.99% in drug screening and 54.41% in montage workflows when additional resources are available, confirming practical viability for cross-facility scientific computing. The work demonstrates a path toward portable, elastic, and fine-grained function-level workflow management across federated CI, enabling researchers to exploit heterogeneous resources and queue-time trade-offs efficiently.

Abstract

Paper Structure (28 sections, 2 equations, 13 figures, 5 tables)

This paper contains 28 sections, 2 equations, 13 figures, 5 tables.

Introduction
Motivation and Background
Programming with UniFaaS
Programming Model
Dynamic Task Graph
Configuration
Architecture and Implementation
Overview
Monitors
Profilers
Scheduler
Data Manager
Task Executor
Fault Tolerance
Optimizations
...and 13 more sections

Figures (13)

Figure 1: UniFaaS architecture.
Figure 2: An illustration of Capacity. EPs 1-3 have 5, 2, and 1 workers, respectively. According to the capacity and DFS order, tasks 1--5, tasks 6--7, and task 8 are assigned to endpoint 1, 2, and 3, respectively.
Figure 3: An illustration of Locality. When an endpoint monitor detects idle resources, it performs locality selection for the next ready task and immediately dispatches the task to the target endpoints.
Figure 4: An illustration of DHA. DHA involves prioritization and endpoint selection. Tasks are not dispatched until the target endpoint has idle resources.
Figure 5: UniFaaS latency breakdown.
...and 8 more figures

UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function Serving

TL;DR

Abstract

UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function Serving

Authors

TL;DR

Abstract

Table of Contents

Figures (13)