Covers fault-tolerance, distributed algorithms, stabilility, parallel computation, and cluster computing.
Looking for a broader view? This category is part of:
LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average $3.05\times$ increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.
Blockchain ecosystems face a significant issue with liquidity fragmentation, as applications and assets are distributed across many public chains with each only accessible by subset of users. Cross-chain communication was designed to address this by allowing chains to interoperate, but existing solutions limit communication to directly connected chains or route traffic through hubs that create bottlenecks and centralization risks. In this paper, we introduce xRoute, a cross-chain routing and message-delivery framework inspired by traditional networks. Our design brings routing, name resolution, and policy-based delivery to the blockchain setting. It allows applications to specify routing policies, enables destination chains to verify that selected routes satisfy security requirements, and uses a decentralized relayer network to compute routes and deliver messages without introducing a trusted hub. Experiments on the chains supporting the Inter-Blockchain Communication (IBC) protocol show that our approach improves connectivity, decentralization, and scalability compared to hub-based designs, particularly under heavy load.
GPUs are becoming a major contributor to data center power, yet unlike CPUs, they can remain at high power even when visible activity is near zero. We call this state execution-idle. Using per-second telemetry from a large academic AI cluster, we characterize execution-idle as a recurring low-activity yet high-power state in real deployments. Across diverse workloads and multiple GPU generations, it accounts for 19.7% of in-execution time and 10.7% of energy. This suggests a need to both reduce the cost of execution-idle and reduce exposure to it. We therefore build two prototypes: one uses automatic downscaling during execution-idle, and the other uses load imbalance to reduce exposure, both with performance trade-offs. These findings suggest that future energy-efficient GPU systems should treat execution-idle as a first-class operating state.
Low Earth orbit (LEO) satellites play an essential role in intelligent Earth observation by leveraging artificial intelligence models. However, limited onboard memory and excessive inference delay prevent the practical deployment of large language models (LLMs) on a single satellite. In this paper, we propose a communication-efficient collaborative LLM inference scheme for LEO satellite networks. Specifically, the entire LLM is split into multiple sub-models, with each deployed on a satellite, thereby enabling collaborative LLM inference via exchanging intermediate activations between satellites. The proposed scheme also leverages the pipeline parallelism mechanism that overlaps sub-model inference with intermediate activation transmission, thereby reducing LLM inference delay. An adaptive activation compression scheme is designed to mitigate cumulative errors from multi-stage model splitting while preserving inference accuracy. Furthermore, we formulate the LLM inference delay minimization problem by jointly optimizing model splitting and compression ratios under onboard memory and inference accuracy constraints. The problem is transformed into a shortest-path search problem over a directed acyclic graph that edge weights explicitly quantify the inference delay induced by model splitting and compression strategies, which is solved via a modified A Star-based search algorithm. Extensive simulation results indicate that the proposed solution can reduce inference delay by up to 42% and communication overhead by up to 71% compared to state-of-the-art benchmarks, while maintaining the inference accuracy loss of less than 1%.
As smart grids increasingly depend on IoT devices and distributed energy management, they require decentralized, low latency orchestration of energy services. We address this with a unified framework for edge fog cloud infrastructures tailored to smart energy systems. It features a graph based data model that captures infrastructure and workload, enabling efficient topology exploration and task placement. Leveraging this model, a swarm-based heuristic algorithm handles task offloading in a resource-aware, latency sensitive manner. Our framework ensures data interoperability via energy data space compliance and guarantees traceability using blockchain based workload notarization. We validate our approach with a real-world KubeEdge deployment, demonstrating zero downtime service migration under dynamic workloads while maintaining service continuity.
We study deterministic exploration by a single agent in $T$-interval-connected graphs, a standard model of dynamic networks in which, for every time window of length $T$, the intersection of the graphs within the window is connected. The agent does not know the window size $T$, nor the number of nodes $n$ or edges $m$, and must visit all nodes of the graph. We consider two visibility models, $KT_0$ and $KT_1$, depending on whether the agent can observe the identifiers of neighboring nodes. We investigate two fundamental questions: the minimum window size that guarantees exploration, and the optimal exploration time under sufficiently large window size. For both models, we show that a window size $T = Ω(m)$ is necessary. We also present deterministic algorithms whose required window size is $O(ε(n,m)\cdot m + n \log^2 n)$, where $ε(n,m) = \frac{\ln n}{1 + \ln m - \ln n}$. These bounds are tight for a wide range of $m$, in particular when $m = n^{1+Θ(1)}$. The same algorithms also yield optimal or near-optimal exploration time: we prove lower bounds of $Ω((m - n + 1)n)$ in the $KT_0$ model and $Ω(m)$ in the $KT_1$ model, and show that our algorithms match these bounds up to a polylogarithmic factor, while being fully time-optimal when $m = n^{1+Θ(1)}$. This yields tight bounds when parameterized solely by $n$: $Θ(n^3)$ for $KT_0$ and $Θ(n^2)$ for $KT_1$.
In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads. Our results show average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs and competitive gains relative to vendor-optimized libraries such as Intel MKL. We demonstrate the practicality of the approach beyond microbenchmarks by implementing a standalone C++ version of the Llama-3.2 inference path using exclusively BLAS-level GEMM calls. These results confirm that leveraging data layout propagation between operations can significantly boost performance.
Satellite emulation software is essential for research due to the lack of access to physical testbeds. To be useful, emulators must generate observations that are well-aligned with real-world ones, and they must have acceptable resource overheads for setting up and running experiments. This study provides an in-depth evaluation of three open-source emulators: StarryNet, OpenSN, and Celestial. Running them side-by-side and comparing them with real-world measurements from the WetLinks study identifies shortcomings of current satellite emulation techniques as well as promising avenues for research and development.
Diffusion models have emerged as the prevailing approach for text-to-image (T2I) and text-to-video (T2V) generation, yet production platforms must increasingly serve both modalities on shared GPU clusters while meeting stringent latency SLOs. Co-serving such heterogeneous workloads is challenging: T2I and T2V requests exhibit vastly different compute demands, parallelism characteristics, and latency requirements, leading to significant SLO violations in existing serving systems. We present GENSERVE, a co-serving system that leverages the inherent predictability of the diffusion process to optimize serving efficiency. A central insight is that diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries, opening a new design space for heterogeneity-aware resource management. GENSERVE introduces step-level resource adaptation through three coordinated mechanisms: intelligent video preemption, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that jointly optimizes resource allocation across all concurrent requests. Experimental results show that GENSERVE improves the SLO attainment rate by up to 44% over the strongest baseline across diverse configurations.
2604.04260We present a comprehensive analysis of Round-Delayed Amnesiac Flooding (RDAF), a variant of Amnesiac Flooding that introduces round-based asynchrony through adversarial delays. We establish fundamental properties of RDAF, including termination characteristics for different graph types and decidability results under various adversarial models. Our key contributions include: (1) a formal model of RDAF incorporating round-based asynchrony, (2) a proof that flooding always terminates on acyclic graphs despite adversarial delays, (3) a construction showing non-termination is possible on any cyclic graph, (4) a demonstration that termination is undecidable with arbitrary computable adversaries, and (5) the introduction of Eventually Periodic Adversaries (EPA) under which termination becomes decidable. These results enhance our understanding of flooding in communication-delay settings and provide insights for designing robust distributed protocols.
2604.03997Autonomous software agents on blockchains solve distributed-coordination problems by reading shared ledger state instead of exchanging direct messages. Liquidation keepers, arbitrage bots, and other autonomous on-chain agents watch balances, contract storage, and event logs; when conditions change, they act. The ledger therefore functions as a replicated shared-state medium through which decentralized agents coordinate indirectly. This form of indirect coordination mirrors what Grassé called stigmergy in 1959: organisms coordinating through traces left in a shared environment, with no central plan. Stigmergy has mature formalizations in swarm intelligence and multi-agent systems, and on-chain agents already behave stigmergically in practice, but no prior application-layer framework cleanly bridges the two. We introduce Indirect coordination grounded in ledger state (Coordinación indirecta basada en el estado del registro contable) as a ledger-specific applied definition that maps Grassé's mechanism onto distributed ledger technology. We operationalize this with a state-transition formalism, identify three recurring base on-chain coordination patterns (State-Flag, Event-Signal, Threshold- Trigger) together with a Commit-Reveal sequencing overlay, and work through a State-Flag task-board example to compare ledger-state coordination analytically with off-chain messaging and centralized orchestration. The contribution is a reusable vocabulary, a ledger-specific formal mapping, and design guidance for decentralized coordination over replicated shared state at the application layer.
DAG-Rider popularized a new paradigm of DAG-BFT protocols, separating dissemination from consensus: all nodes disseminate transactions as blocks that reference previously known blocks, while consensus is reached by electing certain blocks as leaders. This design yields high throughput but confers optimal latency only to leader blocks; non-leader blocks cannot be committed independently. We present Lemonshark, an asynchronous DAG-BFT protocol that reinterprets the DAG at a transactional level and identifies conditions where commitment is sufficient -- but not necessary -- for safe results, enabling nodes to finalize transactions before official commitment, without compromising correctness. Compared to the state-of-the-art asynchronous BFT protocol, Lemonshark reduces latency by up to 65\%.
Operating Elasticsearch clusters at scale demands continuous human expertise spanning the full lifecycle -- from initial deployment through performance tuning, monitoring, failure prediction, and incident recovery. We present the ES Guardian Agent, an autonomous AI SRE system that manages the complete Elasticsearch lifecycle without human intervention through eleven distinct phases: Evaluate, Optimize, Deploy, Calibrate, Stabilize, Alert, Predict, Heal, Learn, and Upgrade. A critical differentiator is its multi-source predictive failure engine, which continuously ingests and correlates metrics trends, application logs, and kernel-level telemetry -- including Linux dmesg streams, NVMe SMART data, NIC bond statistics, and thermal sensors -- to anticipate failures hours before they materialize. By cross-referencing current system signatures against a persistent incident memory of resolved failures, the AI engine stages corrective actions proactively. Through four successive agent architectures -- culminating in a 4,589-line system with five monitoring layers and an iterative AI action loop -- we demonstrate that an LLM equipped with tool-use access can function as a full-lifecycle autonomous SRE targeting six-nines (99.9999%) availability. In production evaluation, the Guardian Agent executed 300 autonomous investigation-and-repair cycles, recovered a cluster from an 18-hour cross-system outage, diagnosed hardware NIC failures across all host nodes, and maintained continuous operational visibility. We establish that data volume per shard -- not tuning -- is the primary determinant of query performance, with latency scaling at 0.26 ms per MB/shard.
2604.03648We study the plurality consensus problem in distributed systems where a population of extremely simple agents, each initially holding one of k opinions, aims to agree on the initially most frequent one. In this setting, h-majority is arguably the simplest and most studied protocol, in which each agent samples the opinion of h neighbors uniformly at random and updates its opinion to the most frequent value in the sample. We propose a new, extremely simple mechanism called DéjàVu: an agent queries neighbors until it encounters an opinion for the second time, at which point it updates its own opinion to the duplicate value. This rule does not require agents to maintain counters or estimate frequencies, nor to choose any parameter (such as a sample size h); it relies solely on the primitive ability to detect repetition. We provide a rigorous analysis of DéjàVu that relies on several technical ideas of independent interest and demonstrates that it is competitive with h-majority and, in some regimes, substantially more communication-efficient, thus yielding a powerful primitive for plurality consensus.
As large-scale HPC compute clusters increasingly adopt accelerators such as GPUs to meet the voracious demands of modern workloads, these clusters are increasingly becoming power constrained. Unfortunately, modern applications can often temporarily exceed the power ratings of the accelerators ("power spikes"). Thus, current and future HPC systems must optimize for both power and performance together. However, this is made difficult by increasingly diverse applications, which often require bespoke optimizations to run efficiently on each cluster. Traditionally researchers overcome this problem by profiling applications on specific clusters and optimizing, but the scale, algorithmic diversity, and lack of effective tools make this challenging. To overcome these inefficiencies, we propose Minos, a systematic classification mechanism that identifies similar application characteristics via low-cost profiling for power and performance. This allows us to group similarly behaving workloads into a finite number of distinct classes and reduce the overhead of extensively profiling new workloads. For example, when predicting frequency capping behavior for a previously unseen application, Minos reduces profiling time by 89%. Moreover, across 18 popular graph analytics, HPC, HPC+ML, and ML workloads, Minos achieves a mean error of 4% for power predictions and 3% for performance predictions, significantly improving predictions over state-of-the-art approaches by 10%.
Multi-agent LLM applications organize execution in synchronized rounds where a central scheduler gathers outputs from all agents and redistributes the combined context. This All-Gather communication pattern creates massive KV Cache redundancy, because every agent's prompt contains the same shared output blocks, yet existing reuse methods fail to exploit it efficiently. We present TokenDance, a system that scales the number of concurrent agents by exploiting the All-Gather pattern for collective KV Cache sharing. TokenDance's KV Collector performs KV Cache reuse over the full round in one collective step, so the cost of reusing a shared block is paid once regardless of agent count. Its Diff-Aware Storage encodes sibling caches as block-sparse diffs against a single master copy, achieving 11-17x compression on representative workloads. Evaluation on GenerativeAgents and AgentSociety shows that TokenDance supports up to 2.7x more concurrent agents than vLLM with prefix caching under SLO requirement, reduces per-agent KV Cache storage by up to 17.5x, and achieves up to 1.9x prefill speedup over per-request position-independent caching.
Memory-disaggregated key-value (KV) stores suffer from a severe performance bottleneck due to their I/O redundancy issues. A huge amount of redundant I/Os are generated when synchronizing concurrent data accesses, making the limited network between the compute and memory pools of DM a performance bottleneck. We identify the root cause for the redundant I/O lies in the mismatch between the optimistic synchronization of existing memory-disaggregated KV stores and the highly concurrent workloads on DM. In this paper, we propose to boost memory-disaggregated KV stores with pessimistic synchronization. We propose CIDER, a compute-side I/O optimization framework, to verify our idea. CIDER adopts a global write-combining technique to further reduce cross-node redundant I/Os. A contention-aware synchronization scheme is designed to improve the performance of pessimistic synchronization under low contention scenarios. Experimental results show that CIDER effectively improves the throughput of state-of-the-art memory-disaggregated KV stores by up to $6.6\times$ under the YCSB benchmark.
Multimodal large language models (MLLMs) enable powerful cross-modal reasoning capabilities but impose substantial computational and latency burdens, posing critical challenges for deployment on resource-constrained edge devices. In this paper, we propose MSAO, an adaptive modality sparsity-aware offloading framework with edge-cloud collaboration for efficient MLLM Inference. First, a lightweight heterogeneous modality-aware via fine-grained sparsity module performs spatial-temporal-modal joint analysis to compute the Modality Activation Sparsity (MAS) metric, which quantifies the necessity of each modality with minimal computational overhead. Second, an adaptive speculative edge-cloud collaborative offloading mechanism dynamically schedules workloads between edge and cloud based on the derived MAS scores and real-time system states, leveraging confidence-guided speculative execution to hide communication latency. Extensive experiments on VQAv2 and MMBench benchmarks demonstrate that MSAO achieves a 30% reduction in end-to-end latency and 30%-65% decrease in resource overhead, while delivering a throughput improvement of 1.5x to 2.3x compared to traditional approaches, all without compromising competitive accuracy.
Advancements in extended reality (XR) are driving the development of the metaverse, which demands efficient real-time transformation of 2D scenes into 3D objects, a computation-intensive process that necessitates task offloading because of complex perception, visual, and audio processing. This challenge is further compounded by asymmetric uplink (UL) and downlink (DL) data characteristics, where 2D data are transmitted in the UL and 3D content is rendered in the DL. To address this issue, we propose a digital twin (DT)-based in-network computing (INC)-assisted multi-access edge computing (MEC) framework that enables real-time synchronization and collaborative computing via URLLC. In this framework, a network operator manages wireless and computational resources for XR user devices (XUDs), while XUDs autonomously offload tasks to maximize their utilities. We model the interactions between XUDs and the operator as a Stackelberg Markov game, where the optimal offloading strategy constitutes an exact potential game with a Nash Equilibrium (NE), and the operator's problem is formulated as an asynchronous Markov decision process (MDP). We further propose a decentralized solution in which XUDs determine offloading decisions based on the operator's joint UL-DL optimization of offloading mode (INC-E or MEC only) and DL power allocation. A Nash-asynchronous hybrid multi-agent reinforcement learning (AMRL) algorithm is developed to predict the UL user-associated and DL transmission power, thereby achieving NE. Simulation results demonstrate that the proposed approach considerably improves system utility, uplink rate, and energy efficiency by reducing latency and optimizing resource utilization in metaverse environments.
Nonlinear time-history evolution problems employing high-fidelity physical models are essential in numerous scientific domains. However, these problems face a critical dual bottleneck: the immense computational cost of time-stepping and the massive memory requirements for maintaining a vast array of state variables. To address these challenges, we propose a novel framework based on heterogeneous memory management for massive ensemble simulations of general nonlinear time-history problems with complex constitutive laws. Taking advantage of recent advancements in CPU-GPU interconnect bandwidth, our approach actively leverages the large capacity of host CPU memory while simultaneously maximizing the throughput of the GPU. This strategy effectively overcomes the GPU memory wall, enabling memory-intensive simulations. We evaluate the performance of the proposed method through comparisons with conventional implementations, demonstrating significant improvements in time-to-solution and energy-to-solution. Furthermore, we demonstrate the practical utility of this framework by developing a Neural Network-based surrogate model using the generated massive datasets. The results highlight the effectiveness of our approach in enabling high-fidelity 3D evaluations and its potential for broader applications in data-driven scientific discovery.