Distributed systems, databases, networking, operating systems, and performance
Emerging IoT applications are transitioning from battery-powered to grid-powered nodes. DRP, a contention-based data dissemination protocol, was developed for these applications. Traditional contention-based protocols resolve collisions through control packet exchanges, significantly reducing goodput. DRP mitigates this issue by employing a distributed delay timer mechanism that assigns transmission-start delays based on the average link quality between a sender and its children, prioritizing highly connected nodes for early transmission. However, our in-field experiments reveal that DRP is unable to accommodate real-world link quality fluctuations, leading to overlapping transmissions from multiple senders. This overlap triggers CSMA's random back-off delays, ultimately degrading the goodput performance. To address these shortcomings, we first conduct a theoretical analysis that characterizes the design requirements induced by real-world link quality fluctuations and DRP's passive acknowledgments. Guided by this analysis, we design EDRP, which integrates two novel components: (i) Link-Quality Aware CSMA (LQ-CSMA) and (ii) a Machine Learning-based Block Size Selection (ML-BSS) algorithm for rateless codes. LQ-CSMA dynamically restricts the back-off delay range based on real-time link quality estimates, ensuring that nodes with stronger connectivity experience shorter delays. ML-BSS algorithm predicts future link quality conditions and optimally adjusts the block size for rateless coding, reducing overhead and enhancing goodput. In-field evaluations of EDRP demonstrate an average goodput improvement of 39.43\% than the competing protocols.
2602.17610Driven by scientific and industry ambition, HPC and AI applications such as operational Numerical Weather Prediction (NWP) require processing and storing ever-increasing data volumes as fast as possible. Whilst POSIX distributed file systems and NVMe SSDs are currently a common HPC storage configuration providing I/O to applications, new storage solutions have proliferated or gained traction over the last decade with potential to address performance limitations POSIX file systems manifest at scale for certain I/O workloads. This work has primarily aimed to assess the suitability and performance of two object storage systems -namely DAOS and Ceph- for the ECMWF's operational NWP as well as for HPC and AI applications in general. New software-level adapters have been developed which enable the ECMWF's NWP to leverage these systems, and extensive I/O benchmarking has been conducted on a few computer systems, comparing the performance delivered by the evaluated object stores to that of equivalent Lustre file system deployments on the same hardware. Challenges of porting to object storage and its benefits with respect to the traditional POSIX I/O approach have been discussed and, where possible, domain-agnostic performance analysis has been conducted, leading to insight also of relevance to I/O practitioners and the broader HPC community. DAOS and Ceph have both demonstrated excellent performance, but DAOS stood out relative to Ceph and Lustre, providing superior scalability and flexibility for applications to perform I/O at scale as desired. This sets a promising outlook for DAOS and object storage, which might see greater adoption at HPC centres in the years to come, although not necessarily implying a shift away from POSIX-like I/O.
Error-bounded lossy compression is essential for managing the massive data volumes produced by large-scale HPC simulations. While state-of-the-art compressors such as SZ and ZFP provide strong numerical error guarantees, they often fail to preserve topological structures (example, minima, maxima, and saddle points) that are critical for scientific analysis. Existing topology-aware compressors address this limitation but incur substantial computational overhead. We present TopoSZp, a lightweight, topology-aware, error-controlled lossy compressor that preserves critical points and their relationships while maintaining high compression and decompression performance. Built on the high-throughput SZp compressor, TopoSZp integrates efficient critical point detection, local ordering preservation, and targeted saddle point refinement, all within a relaxed but strictly enforced error bound. Experimental results on real-world scientific datasets show that TopoSZp achieves 3 to 100 times fewer non-preserved critical points, introduces no false positives or incorrect critical point types, and delivers 100 to 10000 times faster compression and 10 to 500 times faster decompression compared to existing topology-aware compressors, while maintaining competitive compression ratios.
2602.17541We study the self-stabilizing leader election problem in anonymous $n$-nodes networks. Achieving self-stabilization with low space memory complexity is particularly challenging, and designing space-optimal leader election algorithms remains an open problem for general graphs. In deterministic settings, it is known that $Ω(\log \log n)$ bits of memory per node are necessary [Blin et al., Disc. Math. \& Theor. Comput. Sci., 2023], while in probabilistic settings the same lower bound holds for some values of $n$, but only for an unfair scheduler [Beauquier et al., PODC 1999]. Several deterministic and probabilistic protocols have been proposed in models ranging from the state model to the population protocols. However, to the best of our knowledge, existing solutions either require $Ω(\log n)$ bits of memory per node for general worst case graphs, or achieve low state complexity only under restricted network topologies such as rings, trees, or bounded-degree graphs. In this paper, we present a probabilistic self-stabilizing leader election algorithm for arbitrary anonymous networks that uses $O(\log \log n)$ bits of memory per node. Our algorithm operates in the state model under a synchronous scheduler and assumes knowledge of a global parameter $N = Θ(\log n)$. We show that, under our protocol, the system converges almost surely to a stable configuration with a unique leader and stabilizes within $O(\mathrm{poly}(n))$ rounds with high probability. To achieve $O(\log \log n)$ bits of memory, our algorithm keeps transmitting information after convergence, i.e. it does not verify the silence property. Moreover, like most works in the field, our algorithm does not provide explicit termination detection (i.e., nodes do not detect when the algorithm has converged).
High Altitude Platforms (HAPs) are a major advancement in non-terrestrial networks, offering broad coverage and unique capabilities. They form a vital link between satellite systems and terrestrial networks and play a key role in next-generation communication technologies. This study reviews HAP network applications, focusing on advanced airborne communications, integrated sensing, and airborne informatics. Our survey assesses the current state of HAP-centric applications by examining data processing, network performance, computational and storage requirements, economic feasibility, and regulatory challenges. The analysis highlights the evolving role of HAPs in global communication and identifies future research directions to support their deployment.
Machine learning training places immense demands on cluster networks, motivating specialized architectures and co-design with parallelization strategies. Recent designs incorporating optical circuit switches (OCSes) are promising, offering improved cost, power efficiency, and long-term bandwidth scaling than packet switches. However, most existing approaches rely on costly high-radix OCSes and/or combine them with packet switches to achieve competitive performance at scale. Unfortunately, high-radix OCSes are both expensive and slow to reconfigure, limiting both scalability and performance. We propose Arrays of Cheap Optical Switches (ACOS), which bring application co-design directly to the structure of the reconfigurable fabric. Using low-radix OCSes as building blocks, ACOS supports the forms of reconfiguration needed in training clusters including topology selection, workload adaptation, and failure resilience. The cost of ACOS scales with supported topologies and adaptations rather than with port count, breaking past the scalability barriers of current specialized ML networks. We show through simulation that ACOS-based deployments match the performance of fully provisioned packet-switched networks when training state-of-the-art LLMs at scale, while delivering significant cost savings using existing off-the-shelf OCSes, with strong bandwidth scaling and higher cost savings in the future.
Unmanned Aerial Vehicle (UAV)-assisted networks are increasingly foreseen as a promising approach for emergency response, providing rapid, flexible, and resilient communications in environments where terrestrial infrastructure is degraded or unavailable. In such scenarios, voice radio communications remain essential for first responders due to their robustness; however, their unstructured nature prevents direct integration with automated UAV-assisted network management. This paper proposes SIREN, an AI-driven framework that enables voice-driven perception for UAV-assisted networks. By integrating Automatic Speech Recognition (ASR) with Large Language Model (LLM)-based semantic extraction and Natural Language Processing (NLP) validation, SIREN converts emergency voice traffic into structured, machine-readable information, including responding units, location references, emergency severity, and Quality-of-Service (QoS) requirements. SIREN is evaluated using synthetic emergency scenarios with controlled variations in language, speaker count, background noise, and message complexity. The results demonstrate robust transcription and reliable semantic extraction across diverse operating conditions, while highlighting speaker diarization and geographic ambiguity as the main limiting factors. These findings establish the feasibility of voice-driven situational awareness for UAV-assisted networks and show a practical foundation for human-in-the-loop decision support and adaptive network management in emergency response operations.
Connected and Autonomous Vehicles (CAVs) continue to evolve rapidly, and system latency remains one of their most critical performance parameters, particularly when vehicles are operated remotely. Existing latency-assessment methodologies focus predominantly on Glass-to-Glass (G2G) latency, defined as the delay between an event occurring in the operational environment, its capture by a camera, and its subsequent display to the remote operator. However, G2G latency accounts for only one component of the total delay experienced by the driver. The complementary component, Motion-to-Motion (M2M) latency, represents the delay between the initiation of a control input by the remote driver and the corresponding physical actuation by the vehicle. Together, M2M and G2G constitute the overall End-to-End (E2E) latency. This paper introduces a measurement framework capable of quantifying M2M, G2G, and E2E latencies using gyroscopes, a phototransistor, and two GPS-synchronized Raspberry Pi 5 units. The system employs low-pass filtering and threshold-based detection to identify steering-wheel motion on both the remote operator and vehicle sides. An interrupt is generated when the phototransistor detects the activation of an LED positioned within the camera's Field Of View (FOV). Initial measurements obtained from our teleoperated prototype vehicle over commercial 4G and 5G networks indicate an average E2E latency of approximately 500 ms (measurement precision +/- 4 ms). The M2M latency contributes up to 60% of this value.
Parquet is the de facto columnar file format in modern analytical systems, yet its configuration guidelines have largely been shaped by CPU-centric execution models. As GPU-accelerated data processing becomes increasingly prevalent, Parquet files generated with CPU-oriented defaults can severely underutilize GPU parallelism, turning GPU scans into a performance bottleneck. In this work, we systematically study how Parquet configurations affect GPU scan performance. We show that Parquet's poor GPU performance is not inherent to the format itself but rather a consequence of suboptimal configuration choices. By applying GPU-aware configurations, we increase effective read bandwidth up to 125 GB/s without modifying the Parquet specification.
Optimizing resource utilization in high-performance computing (HPC) clusters is essential for maximizing both system efficiency and user satisfaction. However, traditional rigid job scheduling often results in underutilized resources and increased job waiting times. This work evaluates the benefits of resource elasticity, where the job scheduler dynamically adjusts the resource allocation of malleable jobs at runtime. Using real workload traces from the Cori, Eagle, and Theta supercomputers, we simulate varying proportions (0-100%) of malleable jobs with the ElastiSim software. We evaluate five job scheduling strategies, including a novel one that maintains malleable jobs at their preferred resource allocation when possible. Results show that, compared to fully rigid workloads, malleable jobs yield significant improvements across all key metrics. Considering the best-performing scheduling strategy for each supercomputer, job turnaround times decrease by 37-67%, job makespan by 16-65%, job wait times by 73-99%, and node utilization improves by 5-52%. Although improvements vary, gains remain substantial even at 20% malleable jobs. This work highlights important correlations between workload characteristics (e.g., job runtimes and node requirements), malleability proportions, and scheduling strategies. These findings confirm the potential of malleability to address inefficiencies in current HPC practices and demonstrate that even limited adoption can provide substantial advantages, encouraging its integration into HPC resource management.
Processing sensory data close to the data source, often involving Edge devices, promises low latency for pervasive applications, like smart cities. This commonly involves a multitude of processing services, executed with limited resources; this setup faces three problems: first, the application demand and the resource availability fluctuate, so the service execution must scale dynamically to sustain processing requirements (e.g., latency); second, each service permits different actions to adjust its operation, so they require individual scaling policies; third, without a higher-level mediator, services would cannibalize any resources of services co-located on the same device. This demo first presents a platform for context-aware autoscaling of stream processing services that allows developers to monitor and adjust the service execution across multiple service-specific parameters. We then connect a scaling agent to these interfaces that gradually builds an understanding of the processing environment by exploring each service's action space; the agent then optimizes the service execution according to this knowledge. Participants can revisit the demo contents as video summary and introductory poster, or build a custom agent by extending the artifact repository.
AllReduce is a fundamental collective operation in distributed computing and a key performance bottleneck for large-scale training and inference. Its completion time is determined by the number of communication steps, which dominates latency-sensitive workloads, and the communication distance affecting both latency- and bandwidth-bound regimes. Direct-connect topologies, such as torus networks used in Google's TPUv4, are particularly prone to large communication distances due to limited bisection bandwidth. Latency-optimal algorithms such as Bruck's complete AllReduce in $\log_3 n$ steps on a bidirectional ring, but incur large communication distances that result in substantial congestion. In contrast, recent approaches such as Swing reduce communication distance and congestion, but are inherently required to perform $\log_2 n$ steps to complete AllReduce, sacrificing latency-optimality. In this paper, we present Trivance, a novel AllReduce algorithm that completes within $\log_3 n$ steps, while reducing congestion compared to Bruck's algorithm by a factor of three and preserving bandwidth-optimality. Trivance exploits both transmission ports of a bidirectional ring within each step to triple the communication distance along both directions simultaneously. Furthermore, by performing joint reductions, Trivance improves both the number of steps and network congestion. We further show that Trivance extends naturally to multidimensional torus networks, retaining its latency advantage while achieving performance comparable to bandwidth-optimal algorithms for large messages. Our empirical evaluation shows that Trivance improves state-of-the-art approaches by 5-30% for message sizes up to 8\,MiB, in high-bandwidth settings up to 32MiB and for 3D tori up to 128MiB. Throughout the evaluation, Trivance remains the best-performing latency-optimal algorithm.
In this work, we study a hierarchical non-terrestrial network as an edge-cloud platform for remote computing of tasks generated by remote ad-hoc healthcare facility deployments, or internet of medical things (IoMT) devices. We consider a high altitude platform station (HAPS) to provide local multiaccess edge server (MEC) services to a set of remote ground medical devices, and a low-earth orbit (LEO) satellite, serving as a bridge to a remote cloud computing server through a ground gateway (GW), providing a large amount of computing resources to the HAPS. In this hierarchical system, the HAPS and the cloud server charges the ground users and the HAPS for the use of the spectrum and the computing of their tasks respectively. Each tier seeks to maximize their own utility in a selfish manner. To encourage the prompt computation of the tasks, a local delay cost is assumed. We formulate the optimal per-task cost at each tier that influences the corresponding offloading policies, and find the corresponding optimal bandwidth allocation.
Reconfigurable Intelligent Surfaces (RIS) enable dynamic electromagnetic control for 6G networks, but existing control schemes lack responsiveness to fast-varying network conditions, limiting their applicability for ultra-reliable low latency communications. This work addresses uplink delay minimization in multi-RIS scenarios with heterogeneous per-user latency and reliability demands. We propose Delay-Aware RIS Orchestrator (DARIO), an O-RAN-compliant framework that dynamically assigns RIS devices to users within short time windows, adapting to traffic fluctuations to meet per-user delay and reliability targets. DARIO relies on a novel Stochastic Network Calculus (SNC) model to analytically estimate the delay bound for each possible user-RIS assignment under specific traffic and service dynamics. These estimations are used by DARIO to formulate a Nonlinear Integer Program (NIP), for which an online heuristic provides near-optimal performance with low computational overhead. Extensive evaluations with simulations and real traffic traces show consistent delay reductions up to 95.7% under high load or RIS availability.
Approximate $k$ nearest neighbor (AKNN) search in high-dimensional space is a foundational problem in vector databases with widespread applications. Among the numerous AKNN indexes, Proximity Graph-based indexes achieve state-of-the-art search efficiency across various benchmarks. However, their extensive distance computations of high-dimensional vectors lead to slow construction and substantial memory overhead. The limited memory capacity often prevents building the entire index at once when handling large-scale datasets. A common practice is to build multiple sub-indexes separately. However, directly searching on these separated indexes severely compromises search efficiency, as queries cannot leverage cross-graph connections. Therefore, efficient graph index merging is crucial for multi-index searching. In this paper, we focus on efficient two-index merging and the merge order of multiple indexes for AKNN search. To achieve this, we propose a reverse neighbor sliding merge (RNSM) that exploits structural information to boost merging efficiency. We further investigate merge order selection (MOS) to reduce the merging cost by eliminating redundant merge operations. Experiments show that our approach yields up to a 5.48$\times$ speedup over existing index merge methods and 9.92$\times$ speedup over index reconstruction, while maintaining expected superior search performance. Moreover, our method scales efficiently to 100 million vectors with 50 partitions, maintaining consistent speedups.
Independent, street address-level broadband data is essential for evaluating Internet infrastructure investments, such as the $42B Broadband Equity, Access, and Deployment (BEAD) program. Evaluating these investments requires longitudinal visibility into broadband availability, quality, and affordability, including data on pre-disbursement baselines and changes in providers' advertised plans. While such data can be obtained through Internet Service Provider (ISP) web interfaces, these workloads impose three fundamental system requirements: robustness to frequent interface evolution, extensibility across hundreds of providers, and low technical overhead for non-expert users. Existing systems fail to meet these three essential requirements. We present BQT+, a broadband plan measurement framework that replaces monolithic workflows with declarative state/action specifications. BQT+ models querying intent as an interaction state space, formalized as an abstract nondeterministic finite automaton (NFA), and selects execution paths at runtime to accommodate alternative interaction flows and localized interface changes. We show that BQT+ sustains longitudinal monitoring of 64 ISPs, supporting querying for over 100 ISPs. We apply it to two policy studies: constructing a BEAD pre-disbursement baseline and benchmarking broadband affordability across over 124,000 addresses in four states.
The advent of 5G and the emergence of 6G networks demand unprecedented flexibility and efficiency in Radio Access Network (RAN) resource management to satisfy diverse service-level agreements (SLAs). Existing RAN slicing frameworks predominantly rely on per-slice resource reservation, which ensures performance isolation but leads to inefficient utilization, particularly under bursty traffic. We introduce HyRA, a hybrid resource allocation framework for RAN slicing that combines dedicated per-slice allocations with shared resource pooling across slices. HyRA preserves performance isolation while improving resource efficiency by leveraging multiplexing gains in bursty traffic conditions. We formulate this design as a bi-level stochastic optimization problem, where the outer loop determines the dedicated and shared resource budgets and the inner loop performs per-UE scheduling under a novel water-filling approach. By using the sample-average approximation, the Karush-Kuhn-Tucker (KKT) conditions of the inner loop, and Big-M encoding, we transform the problem into a tractable mixed-integer program that standard optimization solvers can solve. Extensive simulations under diverse demand patterns, SLA configurations, and traffic burstiness show that HyRA achieves up to 50-75% spectrum savings compared to dedicated-only and shared-only baselines. These results highlight HyRA as a viable approach for resource-efficient, SLA-compliant RAN slicing in future mobile networks.
Large Language Models (LLMs) have demonstrated remarkable effectiveness in adapting to downstream tasks through fine-tuning. Federated Learning (FL) extends this capability by enabling collaborative fine-tuning across distributed clients using Low-Rank Adaptation (LoRA), while preserving data privacy by avoiding raw data sharing. However, practical deployments face challenges when clients have heterogeneous resources and thus adopt different LoRA ranks, leading to substantial initialization and aggregation noise that undermines performance. To address these challenges, we propose Fed-PLoRA, a novel lightweight heterogeneous federated fine-tuning (FFT) framework. Fed-PLoRA introduces Parallel One-Rank Adaptation (PLoRA), a new LoRA variant that replaces the classic multi-rank LoRA module with multiple parallel one-rank modules, and a novel Select-N-Fold strategy that folds untrained PLoRA modules into the pre-trained weights before local training, thereby accommodating heterogeneous client resources. We provide a unified analysis of initialization and aggregation noise of Fed-PLoRA and demonstrate how it addresses the limitations of state-of-the-art methods. Extensive experiments on diverse LLM fine-tuning tasks demonstrate that Fed-PLoRA consistently outperforms existing methods in both accuracy and efficiency. The code is available at https://github.com/TNI-playground/Fed-PLoRA.
In the context of asynchronous concurrent shared-memory systems, a snapshot algorithm allows failure-prone processes to concurrently and atomically write on the entries of a shared array MEM , and also atomically read the whole array. Recently, Read-Modify-Writable (RMWable) snapshot was proposed, a variant of snapshot that allows processes to perform operations more complex than just read and write, specifically, each entry MEM[k] is an arbitrary readable object. The known RMWable snapshot algorithms heavily rely on powerful low-level operations such as compare&swap or load-link/store-conditional to correctly produce snapshots of MEM. Following the large body of research devoted to understand the limits of what can be solved using the simple read/write low-level operations, which are known to be strictly weaker than compare&swap and load-link/store-conditional, we explore if RMWable snapshots are possible using only read/write operations. We present two read/write RMWable snapshot algorithms, the first one in the standard concurrent shared-memory model where the number of processes n is finite and known in advance, and the second one in a variant of the standard model with unbounded concurrency, where there are infinitely many processes, but at any moment only finitely many processes participate in an execution.
The deployment of deep learning inference in production environments continues to grow, where throughput, latency, and hardware efficiency are critical. Although specialized accelerators are increasingly adopted, many inference workloads still run on CPU-only systems, particularly in legacy data centers and cost-sensitive environments. This study investigates the scalability limits of CPU-based inference for convolutional neural networks by benchmarking ResNet models across varying batch sizes on two hardware tiers: a legacy Intel Xeon E5-2403 v2 processor and a modern Intel Xeon 6 "Granite Rapids" platform. Results show that legacy CPUs quickly reach throughput saturation, with limited scaling beyond small batch sizes due to instruction-level and memory constraints. In contrast, the Granite Rapids system leverages Intel Advanced Matrix Extensions (AMX) to achieve substantially higher throughput. However, oversubscription beyond physical core limits introduces execution contention and tail-latency amplification, revealing a performance degradation regime in modern architectures. We introduce GDEV-AI, a reproducible benchmarking framework for analyzing scalability behavior and architectural saturation in CPU-based inference. By establishing a vendor-neutral baseline, this work provides empirical insight into performance bottlenecks and informs capacity planning in heterogeneous data center environments.