Systems & Networking

Distributed systems, databases, networking, operating systems, and performance

Includes:Distributed, Parallel, and Cluster Computing(cs.DC)Networking and Internet Architecture(cs.NI)Operating Systems(cs.OS)Performance(cs.PF)Databases(cs.DB)

Systems & Networking

Distributed systems, databases, networking, operating systems, and performance

Includes:Distributed, Parallel, and Cluster Computing(cs.DC)Networking and Internet Architecture(cs.NI)Operating Systems(cs.OS)Performance(cs.PF)Databases(cs.DB)

Trending in Systems & Networking

DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows

Operational rigor determines whether human-agent collaboration succeeds or fails. Scientific data pipelines need the equivalent of DevOps -- SciOps -- yet common approaches fragment provenance across disconnected systems without transactional guarantees. DataJoint 2.0 addresses this gap through the relational workflow model: tables represent workflow steps, rows represent artifacts, foreign keys prescribe execution order. The schema specifies not only what data exists but how it is derived -- a single formal system where data structure, computational dependencies, and integrity constraints are all queryable, enforceable, and machine-readable. Four technical innovations extend this foundation: object-augmented schemas integrating relational metadata with scalable object storage, semantic matching using attribute lineage to prevent erroneous joins, an extensible type system for domain-specific formats, and distributed job coordination designed for composability with external orchestration. By unifying data structure, data, and computational transformations, DataJoint creates a substrate for SciOps where agents can participate in scientific workflows without risking data corruption.

2602.16585

Feb 2026Databases

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

Large language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: https://github.com/Agentic-Kinetics/ThunderAgent.

2602.13692

Feb 2026Operating Systems

Fine-Tuning GPT-5 for GPU Kernel Generation

Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.

2602.11000

Feb 2026Distributed, Parallel, and Cluster Computing

Benchmarking Large Language Models for Knowledge Graph Validation

Knowledge Graphs (KGs) store structured factual knowledge by linking entities through relationships, crucial for many applications. These applications depend on the KG's factual accuracy, so verifying facts is essential, yet challenging. Expert manual verification is ideal but impractical on a large scale. Automated methods show promise but are not ready for real-world KGs. Large Language Models (LLMs) offer potential with their semantic understanding and knowledge access, yet their suitability and effectiveness for KG fact validation remain largely unexplored. In this paper, we introduce FactCheck, a benchmark designed to evaluate LLMs for KG fact validation across three key dimensions: (1) LLMs internal knowledge; (2) external evidence via Retrieval-Augmented Generation (RAG); and (3) aggregated knowledge employing a multi-model consensus strategy. We evaluated open-source and commercial LLMs on three diverse real-world KGs. FactCheck also includes a RAG dataset with 2+ million documents tailored for KG fact validation. Additionally, we offer an interactive exploration platform for analyzing verification decisions. The experimental analyses demonstrate that while LLMs yield promising results, they are still not sufficiently stable and reliable to be used in real-world KG validation scenarios. Integrating external evidence through RAG methods yields fluctuating performance, providing inconsistent improvements over more streamlined approaches -- at higher computational costs. Similarly, strategies based on multi-model consensus do not consistently outperform individual models, underscoring the lack of a one-fits-all solution. These findings further emphasize the need for a benchmark like FactCheck to systematically evaluate and drive progress on this difficult yet crucial task.

2602.10748

Feb 2026Databases

vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models

Any-to-any multimodal models that jointly handle text, images, video, and audio represent a significant advance in multimodal AI. However, their complex architectures (typically combining multiple autoregressive LLMs, diffusion transformers, and other specialized components) pose substantial challenges for efficient model serving. Existing serving systems are mainly tailored to a single paradigm, such as autoregressive LLMs for text generation or diffusion transformers for visual generation. They lack support for any-to-any pipelines that involve multiple interconnected model components. As a result, developers must manually handle cross-stage interactions, leading to huge performance degradation. We present vLLM-Omni, a fully disaggregated serving system for any-to-any models. vLLM-Omni features a novel stage abstraction that enables users to decompose complex any-to-any architectures into interconnected stages represented as a graph, and a disaggregated stage execution backend that optimizes resource utilization and throughput across stages. Each stage is independently served by an LLM or diffusion engine with per-stage request batching, flexible GPU allocation, and unified inter-stage connectors for data routing. Experimental results demonstrate that vLLM-Omni reduces job completion time (JCT) by up to 91.4% compared to baseline methods. The code is public available at https://github.com/vllm-project/vllm-omni.

2602.02204

Feb 2026Distributed, Parallel, and Cluster Computing

The Stretto Execution Engine for LLM-Augmented Data Systems

LLM-augmented data systems enable semantic querying over structured and unstructured data, but executing queries with LLM-powered operators introduces a fundamental runtime--accuracy trade-off. In this paper, we present Stretto, a new execution engine that provides end-to-end query guarantees while efficiently navigating this trade-off in a holistic manner. For this, Stretto formulates query planning as a constrained optimization problem and uses a gradient-based optimizer to jointly select operator implementations and allocate error budgets across pipelines. Moreover, to enable fine-grained execution choices, Stretto introduces a novel idea on how KV-caching can be used to realize a spectrum of different physical operators that transform a sparse design space into a dense continuum of runtime--accuracy trade-offs. Experiments show that Stretto outperforms state-of-the-art systems while consistently meeting quality guarantees.

2602.04430

Feb 2026Databases

PROBE: Co-Balancing Computation and Communication in MoE Inference via Real-Time Predictive Prefetching

Mixture-of-Experts models have become a dominant architecture for scaling Large Language Models by activating only a sparse subset of experts per token. However, latency-critical MoE inference faces a fundamental tension: while expert parallelism improves memory efficiency, it also amplifies execution stragglers. In real-world serving, continuous batching and diverse concurrent requests induce rapid semantic shifts, causing expert hotspots to migrate abruptly across GPUs and triggering the 'double penalty' of coupled computational skew and network congestion. We propose PROBE, an inference system that co-balances computation and communication in real time. PROBE introduces Continuous Lookahead Pipelining, which proactively predicts, plans, and prefetches for upcoming layers while keeping all control overheads off the critical path. PROBE consists of: (1) a Gate-Initialized Lookahead Predictor that distills the target router to forecast next-layer expert activation with high fidelity; (2) a Hardware-Aware Balance Planning solver that jointly optimizes dynamic expert replication and token assignment under strict hiding-window constraints; and (3) a Phase-Locked Co-Scheduling policy that uses split-phase transmission to hide bandwidth-intensive expert transfers behind computation without contending with All-to-All collectives. Experiments show that PROBE reduces prefill latency by up to 1.32X and improves decoding throughput by up to 1.26X over state-of-the-art baselines, especially under extreme workload volatility.

2602.00509

Jan 2026Distributed, Parallel, and Cluster Computing

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.

2601.20309

Jan 2026Distributed, Parallel, and Cluster Computing

Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers

Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-aware abstraction that maps logical tensor coordinates to a multi-axis physical space via named axes. Axe unifies tiling, sharding, replication, and offsets across inter-device distribution and on-device layouts, enabling collective primitives to be expressed consistently from device meshes to threads. Building on Axe, we design a multi-granularity, distribution-aware DSL and compiler that composes thread-local control with collective operators in a single kernel. Experiments show that our unified approach can bring performance close to hand-tuned kernels on across latest GPU devices and multi-device environments and accelerator backends.

2601.19092

Jan 2026Distributed, Parallel, and Cluster Computing

MegaFlow: Large-Scale Distributed Orchestration System for the Agentic Era

The rapid development of interactive and autonomous AI systems signals our entry into the agentic era. Training and evaluating agents on complex agentic tasks such as software engineering and computer use requires not only efficient model computation but also sophisticated infrastructure capable of coordinating vast agent-environment interactions. However, no open-source infrastructure can effectively support large-scale training and evaluation on such complex agentic tasks. To address this challenge, we present MegaFlow, a large-scale distributed orchestration system that enables efficient scheduling, resource allocation, and fine-grained task management for agent-environment workloads. MegaFlow abstracts agent training infrastructure into three independent services (Model Service, Agent Service, and Environment Service) that interact through unified interfaces, enabling independent scaling and flexible resource allocation across diverse agent-environment configurations. In our agent training deployments, MegaFlow successfully orchestrates tens of thousands of concurrent agent tasks while maintaining high system stability and achieving efficient resource utilization. By enabling such large-scale agent training, MegaFlow addresses a critical infrastructure gap in the emerging agentic AI landscape.

2601.07526

Jan 2026Distributed, Parallel, and Cluster Computing

OrchestrRL: Dynamic Compute and Network Orchestration for Disaggregated RL

Post-training with reinforcement learning (RL) has greatly enhanced the capabilities of large language models. Disaggregating the generation and training stages in RL into a parallel, asynchronous pipeline offers the potential for flexible scaling and improved throughput. However, it still faces two critical challenges. First, the generation stage often becomes a bottleneck due to dynamic workload shifts and severe execution imbalances. Second, the decoupled stages result in diverse and dynamic network traffic patterns that overwhelm conventional network fabrics. This paper introduces OrchestrRL, an orchestration framework that dynamically manages compute and network rhythms in disaggregated RL. To improve generation efficiency, OrchestrRL employs an adaptive compute scheduler that dynamically adjusts parallelism to match workload characteristics within and across generation steps. This accelerates execution while continuously rebalancing requests to mitigate stragglers. To address the dynamic network demands inherent in disaggregated RL -- further intensified by parallelism switching -- we co-design RFabric, a reconfigurable hybrid optical-electrical fabric. RFabric leverages optical circuit switches at selected network tiers to reconfigure the topology in real time, enabling workload-aware circuits for (i) layer-wise collective communication during training iterations, (ii) generation under different parallelism configurations, and (iii) periodic inter-cluster weight synchronization. We evaluate OrchestrRL on a physical testbed with 48 H800 GPUs, demonstrating up to a 1.40x throughput improvement. Furthermore, we develop RLSim, a high-fidelity simulator, to evaluate RFabric at scale. Our results show that RFabric achieves superior performance-cost efficiency compared to static Fat-Tree networks, establishing it as a highly effective solution for large-scale RL workloads.

2601.01209

Jan 2026Distributed, Parallel, and Cluster Computing

MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability

We present MAESTRO, an evaluation suite for the testing, reliability, and observability of LLM-based MAS. MAESTRO standardizes MAS configuration and execution through a unified interface, supports integrating both native and third-party MAS via a repository of examples and lightweight adapters, and exports framework-agnostic execution traces together with system-level signals (e.g., latency, cost, and failures). We instantiate MAESTRO with 12 representative MAS spanning popular agentic frameworks and interaction patterns, and conduct controlled experiments across repeated runs, backend models, and tool configurations. Our case studies show that MAS executions can be structurally stable yet temporally variable, leading to substantial run-to-run variance in performance and reliability. We further find that MAS architecture is the dominant driver of resource profiles, reproducibility, and cost-latency-accuracy trade-off, often outweighing changes in backend models or tool settings. Overall, MAESTRO enables systematic evaluation and provides empirical guidance for designing and optimizing agentic systems.

2601.00481

Jan 2026Networking and Internet Architecture

RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure

Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning. Unlike standard LLM post-training, agentic RL workloads are highly heterogeneous, combining compute-intensive prefill phases, bandwidth-bound decoding, and stateful, CPU-heavy environment simulations. We argue that efficient agentic RL training requires disaggregated infrastructure to leverage specialized, best-fit hardware. However, naive disaggregation introduces substantial synchronization overhead and resource underutilization due to the complex dependencies between stages. We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure. RollArc is built on three core principles: (1) hardware-affinity workload mapping, which routes compute-bound and bandwidth-bound tasks to bestfit GPU devices, (2) fine-grained asynchrony, which manages execution at the trajectory level to mitigate resource bubbles, and (3) statefulness-aware computation, which offloads stateless components (e.g., reward models) to serverless infrastructure for elastic scaling. Our results demonstrate that RollArc effectively improves training throughput and achieves 1.35-2.05$\times$ end-to-end training time reduction compared to monolithic and synchronous baselines. We also evaluate RollArc by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with more than 3,000 GPUs, further demonstrating RollArc scalability and robustness. The code is available at https://github.com/alibaba/ROLL.

2512.225601

Dec 2025Distributed, Parallel, and Cluster Computing

ECCO: Leveraging Cross-Camera Correlations for Efficient Live Video Continuous Learning

Recent advances in video analytics address real-time data drift by continuously retraining specialized, lightweight DNN models for individual cameras. However, the current practice of retraining a separate model for each camera suffers from high compute and communication costs, making it unscalable. We present ECCO, a new video analytics framework designed for resource-efficient continuous learning. The key insight is that the data drift, which necessitates model retraining, often shows temporal and spatial correlations across nearby cameras. By identifying cameras that experience similar drift and retraining a shared model for them, ECCO can substantially reduce the associated compute and communication costs. Specifically, ECCO introduces: (i) a lightweight grouping algorithm that dynamically forms and updates camera groups; (ii) a GPU allocator that dynamically assigns GPU resources across different groups to improve retraining accuracy and ensure fairness; and (iii) a transmission controller at each camera that configures frame sampling and coordinates bandwidth sharing with other cameras based on its assigned GPU resources. We conducted extensive evaluations on three distinctive datasets for two vision tasks. Compared to leading baselines, ECCO improves retraining accuracy by 6.7%-18.1% using the same compute and communication resources, or supports 3.3 times more concurrent cameras at the same accuracy.

2512.11727

Dec 2025Distributed, Parallel, and Cluster Computing

RDMA Point-to-Point Communication for LLM Systems

Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs to expose a uniform interface. TransferEngine exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase TransferEngine through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in.

2510.276562

Oct 2025Distributed, Parallel, and Cluster Computing

7,635 papers

EDRP: Enhanced Dynamic Relay Point Protocol for Data Dissemination in Multi-hop Wireless IoT Networks

Emerging IoT applications are transitioning from battery-powered to grid-powered nodes. DRP, a contention-based data dissemination protocol, was developed for these applications. Traditional contention-based protocols resolve collisions through control packet exchanges, significantly reducing goodput. DRP mitigates this issue by employing a distributed delay timer mechanism that assigns transmission-start delays based on the average link quality between a sender and its children, prioritizing highly connected nodes for early transmission. However, our in-field experiments reveal that DRP is unable to accommodate real-world link quality fluctuations, leading to overlapping transmissions from multiple senders. This overlap triggers CSMA's random back-off delays, ultimately degrading the goodput performance. To address these shortcomings, we first conduct a theoretical analysis that characterizes the design requirements induced by real-world link quality fluctuations and DRP's passive acknowledgments. Guided by this analysis, we design EDRP, which integrates two novel components: (i) Link-Quality Aware CSMA (LQ-CSMA) and (ii) a Machine Learning-based Block Size Selection (ML-BSS) algorithm for rateless codes. LQ-CSMA dynamically restricts the back-off delay range based on real-time link quality estimates, ensuring that nodes with stronger connectivity experience shorter delays. ML-BSS algorithm predicts future link quality conditions and optimally adjusts the block size for rateless coding, reducing overhead and enhancing goodput. In-field evaluations of EDRP demonstrate an average goodput improvement of 39.43\% than the competing protocols.

2602.17619Feb 2026

View

2602.17610

Exploring Novel Data Storage Approaches for Large-Scale Numerical Weather Prediction

Driven by scientific and industry ambition, HPC and AI applications such as operational Numerical Weather Prediction (NWP) require processing and storing ever-increasing data volumes as fast as possible. Whilst POSIX distributed file systems and NVMe SSDs are currently a common HPC storage configuration providing I/O to applications, new storage solutions have proliferated or gained traction over the last decade with potential to address performance limitations POSIX file systems manifest at scale for certain I/O workloads. This work has primarily aimed to assess the suitability and performance of two object storage systems -namely DAOS and Ceph- for the ECMWF's operational NWP as well as for HPC and AI applications in general. New software-level adapters have been developed which enable the ECMWF's NWP to leverage these systems, and extensive I/O benchmarking has been conducted on a few computer systems, comparing the performance delivered by the evaluated object stores to that of equivalent Lustre file system deployments on the same hardware. Challenges of porting to object storage and its benefits with respect to the traditional POSIX I/O approach have been discussed and, where possible, domain-agnostic performance analysis has been conducted, leading to insight also of relevance to I/O practitioners and the broader HPC community. DAOS and Ceph have both demonstrated excellent performance, but DAOS stood out relative to Ceph and Lustre, providing superior scalability and flexibility for applications to perform I/O at scale as desired. This sets a promising outlook for DAOS and object storage, which might see greater adoption at HPC centres in the years to come, although not necessarily implying a shift away from POSIX-like I/O.

2602.17610Feb 2026

View

TopoSZp: Lightweight Topology-Aware Error-controlled Compression for Scientific Data

Error-bounded lossy compression is essential for managing the massive data volumes produced by large-scale HPC simulations. While state-of-the-art compressors such as SZ and ZFP provide strong numerical error guarantees, they often fail to preserve topological structures (example, minima, maxima, and saddle points) that are critical for scientific analysis. Existing topology-aware compressors address this limitation but incur substantial computational overhead. We present TopoSZp, a lightweight, topology-aware, error-controlled lossy compressor that preserves critical points and their relationships while maintaining high compression and decompression performance. Built on the high-throughput SZp compressor, TopoSZp integrates efficient critical point detection, local ordering preservation, and targeted saddle point refinement, all within a relaxed but strictly enforced error bound. Experimental results on real-world scientific datasets show that TopoSZp achieves 3 to 100 times fewer non-preserved critical points, introduces no false positives or incorrect critical point types, and delivers 100 to 10000 times faster compression and 10 to 500 times faster decompression compared to existing topology-aware compressors, while maintaining competitive compression ratios.

2602.17552Feb 2026

View

2602.17541

Informative Trains: A Memory-Efficient Journey to a Self-Stabilizing Leader Election Algorithm in Anonymous Graphs

We study the self-stabilizing leader election problem in anonymous $n$-nodes networks. Achieving self-stabilization with low space memory complexity is particularly challenging, and designing space-optimal leader election algorithms remains an open problem for general graphs. In deterministic settings, it is known that $Ω(\log \log n)$ bits of memory per node are necessary [Blin et al., Disc. Math. \& Theor. Comput. Sci., 2023], while in probabilistic settings the same lower bound holds for some values of $n$, but only for an unfair scheduler [Beauquier et al., PODC 1999]. Several deterministic and probabilistic protocols have been proposed in models ranging from the state model to the population protocols. However, to the best of our knowledge, existing solutions either require $Ω(\log n)$ bits of memory per node for general worst case graphs, or achieve low state complexity only under restricted network topologies such as rings, trees, or bounded-degree graphs. In this paper, we present a probabilistic self-stabilizing leader election algorithm for arbitrary anonymous networks that uses $O(\log \log n)$ bits of memory per node. Our algorithm operates in the state model under a synchronous scheduler and assumes knowledge of a global parameter $N = Θ(\log n)$. We show that, under our protocol, the system converges almost surely to a stable configuration with a unique leader and stabilizes within $O(\mathrm{poly}(n))$ rounds with high probability. To achieve $O(\log \log n)$ bits of memory, our algorithm keeps transmitting information after convergence, i.e. it does not verify the silence property. Moreover, like most works in the field, our algorithm does not provide explicit termination detection (i.e., nodes do not detect when the algorithm has converged).

2602.17541Feb 2026

View

HAP Networks for the Future: Applications in Sensing, Computing, and Communication

High Altitude Platforms (HAPs) are a major advancement in non-terrestrial networks, offering broad coverage and unique capabilities. They form a vital link between satellite systems and terrestrial networks and play a key role in next-generation communication technologies. This study reviews HAP network applications, focusing on advanced airborne communications, integrated sensing, and airborne informatics. Our survey assesses the current state of HAP-centric applications by examining data processing, network performance, computational and storage requirements, economic feasibility, and regulatory challenges. The analysis highlights the evolving role of HAPs in global communication and identifies future research directions to support their deployment.

2602.17534Feb 2026

View

ACOS: Arrays of Cheap Optical Switches

Machine learning training places immense demands on cluster networks, motivating specialized architectures and co-design with parallelization strategies. Recent designs incorporating optical circuit switches (OCSes) are promising, offering improved cost, power efficiency, and long-term bandwidth scaling than packet switches. However, most existing approaches rely on costly high-radix OCSes and/or combine them with packet switches to achieve competitive performance at scale. Unfortunately, high-radix OCSes are both expensive and slow to reconfigure, limiting both scalability and performance. We propose Arrays of Cheap Optical Switches (ACOS), which bring application co-design directly to the structure of the reconfigurable fabric. Using low-radix OCSes as building blocks, ACOS supports the forms of reconfiguration needed in training clusters including topology selection, workload adaptation, and failure resilience. The cost of ACOS scales with supported topologies and adaptations rather than with port count, breaking past the scalability barriers of current specialized ML networks. We show through simulation that ACOS-based deployments match the performance of fully provisioned packet-switched networks when training state-of-the-art LLMs at scale, while delivering significant cost savings using existing off-the-shelf OCSes, with strong bandwidth scaling and higher cost savings in the future.

2602.17449Feb 2026

View

Voice-Driven Semantic Perception for UAV-Assisted Emergency Networks

Unmanned Aerial Vehicle (UAV)-assisted networks are increasingly foreseen as a promising approach for emergency response, providing rapid, flexible, and resilient communications in environments where terrestrial infrastructure is degraded or unavailable. In such scenarios, voice radio communications remain essential for first responders due to their robustness; however, their unstructured nature prevents direct integration with automated UAV-assisted network management. This paper proposes SIREN, an AI-driven framework that enables voice-driven perception for UAV-assisted networks. By integrating Automatic Speech Recognition (ASR) with Large Language Model (LLM)-based semantic extraction and Natural Language Processing (NLP) validation, SIREN converts emergency voice traffic into structured, machine-readable information, including responding units, location references, emergency severity, and Quality-of-Service (QoS) requirements. SIREN is evaluated using synthetic emergency scenarios with controlled variations in language, speaker count, background noise, and message complexity. The results demonstrate robust transcription and reliable semantic extraction across diverse operating conditions, while highlighting speaker diarization and geographic ambiguity as the main limiting factors. These findings establish the feasibility of voice-driven situational awareness for UAV-assisted networks and show a practical foundation for human-in-the-loop decision support and adaptive network management in emergency response operations.

2602.17394Feb 2026

View

End-to-End Latency Measurement Methodology for Connected and Autonomous Vehicle Teleoperation

Connected and Autonomous Vehicles (CAVs) continue to evolve rapidly, and system latency remains one of their most critical performance parameters, particularly when vehicles are operated remotely. Existing latency-assessment methodologies focus predominantly on Glass-to-Glass (G2G) latency, defined as the delay between an event occurring in the operational environment, its capture by a camera, and its subsequent display to the remote operator. However, G2G latency accounts for only one component of the total delay experienced by the driver. The complementary component, Motion-to-Motion (M2M) latency, represents the delay between the initiation of a control input by the remote driver and the corresponding physical actuation by the vehicle. Together, M2M and G2G constitute the overall End-to-End (E2E) latency. This paper introduces a measurement framework capable of quantifying M2M, G2G, and E2E latencies using gyroscopes, a phototransistor, and two GPS-synchronized Raspberry Pi 5 units. The system employs low-pass filtering and threshold-based detection to identify steering-wheel motion on both the remote operator and vehicle sides. An interrupt is generated when the phototransistor detects the activation of an LED positioned within the camera's Field Of View (FOV). Initial measurements obtained from our teleoperated prototype vehicle over commercial 4G and 5G networks indicate an average E2E latency of approximately 500 ms (measurement precision +/- 4 ms). The M2M latency contributes up to 60% of this value.

2602.17381Feb 2026

View

Do GPUs Really Need New Tabular File Formats?

Parquet is the de facto columnar file format in modern analytical systems, yet its configuration guidelines have largely been shaped by CPU-centric execution models. As GPU-accelerated data processing becomes increasingly prevalent, Parquet files generated with CPU-oriented defaults can severely underutilize GPU parallelism, turning GPU scans into a performance bottleneck. In this work, we systematically study how Parquet configurations affect GPU scan performance. We show that Parquet's poor GPU performance is not inherent to the format itself but rather a consequence of suboptimal configuration choices. By applying GPU-aware configurations, we increase effective read bandwidth up to 125 GB/s without modifying the Parquet specification.

2602.17335Feb 2026

View

Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads

Optimizing resource utilization in high-performance computing (HPC) clusters is essential for maximizing both system efficiency and user satisfaction. However, traditional rigid job scheduling often results in underutilized resources and increased job waiting times. This work evaluates the benefits of resource elasticity, where the job scheduler dynamically adjusts the resource allocation of malleable jobs at runtime. Using real workload traces from the Cori, Eagle, and Theta supercomputers, we simulate varying proportions (0-100%) of malleable jobs with the ElastiSim software. We evaluate five job scheduling strategies, including a novel one that maintains malleable jobs at their preferred resource allocation when possible. Results show that, compared to fully rigid workloads, malleable jobs yield significant improvements across all key metrics. Considering the best-performing scheduling strategy for each supercomputer, job turnaround times decrease by 37-67%, job makespan by 16-65%, job wait times by 73-99%, and node utilization improves by 5-52%. Although improvements vary, gains remain substantial even at 20% malleable jobs. This work highlights important correlations between workload characteristics (e.g., job runtimes and node requirements), malleability proportions, and scheduling strategies. These findings confirm the potential of malleability to address inefficiencies in current HPC practices and demonstrate that even limited adoption can provide substantial advantages, encouraging its integration into HPC resource management.

2602.17318Feb 2026

View

Visual Insights into Agentic Optimization of Pervasive Stream Processing Services

Processing sensory data close to the data source, often involving Edge devices, promises low latency for pervasive applications, like smart cities. This commonly involves a multitude of processing services, executed with limited resources; this setup faces three problems: first, the application demand and the resource availability fluctuate, so the service execution must scale dynamically to sustain processing requirements (e.g., latency); second, each service permits different actions to adjust its operation, so they require individual scaling policies; third, without a higher-level mediator, services would cannibalize any resources of services co-located on the same device. This demo first presents a platform for context-aware autoscaling of stream processing services that allows developers to monitor and adjust the service execution across multiple service-specific parameters. We then connect a scaling agent to these interfaces that gradually builds an understanding of the processing environment by exploring each service's action space; the agent then optimizes the service execution according to this knowledge. Participants can revisit the demo contents as video summary and introductory poster, or build a custom agent by extending the artifact repository.

2602.17282Feb 2026

View

Trivance: Latency-Optimal AllReduce by Shortcutting Multiport Networks

AllReduce is a fundamental collective operation in distributed computing and a key performance bottleneck for large-scale training and inference. Its completion time is determined by the number of communication steps, which dominates latency-sensitive workloads, and the communication distance affecting both latency- and bandwidth-bound regimes. Direct-connect topologies, such as torus networks used in Google's TPUv4, are particularly prone to large communication distances due to limited bisection bandwidth. Latency-optimal algorithms such as Bruck's complete AllReduce in $\log_3 n$ steps on a bidirectional ring, but incur large communication distances that result in substantial congestion. In contrast, recent approaches such as Swing reduce communication distance and congestion, but are inherently required to perform $\log_2 n$ steps to complete AllReduce, sacrificing latency-optimality. In this paper, we present Trivance, a novel AllReduce algorithm that completes within $\log_3 n$ steps, while reducing congestion compared to Bruck's algorithm by a factor of three and preserving bandwidth-optimality. Trivance exploits both transmission ports of a bidirectional ring within each step to triple the communication distance along both directions simultaneously. Furthermore, by performing joint reductions, Trivance improves both the number of steps and network congestion. We further show that Trivance extends naturally to multidimensional torus networks, retaining its latency advantage while achieving performance comparable to bandwidth-optimal algorithms for large messages. Our empirical evaluation shows that Trivance improves state-of-the-art approaches by 5-30% for message sizes up to 8\,MiB, in high-bandwidth settings up to 32MiB and for 3D tori up to 128MiB. Throughout the evaluation, Trivance remains the best-performing latency-optimal algorithm.

2602.17254Feb 2026

View

Hierarchical Edge-Cloud Task Offloading in NTN for Remote Healthcare

In this work, we study a hierarchical non-terrestrial network as an edge-cloud platform for remote computing of tasks generated by remote ad-hoc healthcare facility deployments, or internet of medical things (IoMT) devices. We consider a high altitude platform station (HAPS) to provide local multiaccess edge server (MEC) services to a set of remote ground medical devices, and a low-earth orbit (LEO) satellite, serving as a bridge to a remote cloud computing server through a ground gateway (GW), providing a large amount of computing resources to the HAPS. In this hierarchical system, the HAPS and the cloud server charges the ground users and the HAPS for the use of the spectrum and the computing of their tasks respectively. Each tier seeks to maximize their own utility in a selfish manner. To encourage the prompt computation of the tasks, a local delay cost is assumed. We formulate the optimal per-task cost at each tier that influences the corresponding offloading policies, and find the corresponding optimal bandwidth allocation.

2602.17209Feb 2026

View

RIS Control through the Lens of Stochastic Network Calculus: An O-RAN Framework for Delay-Sensitive 6G Applications

Reconfigurable Intelligent Surfaces (RIS) enable dynamic electromagnetic control for 6G networks, but existing control schemes lack responsiveness to fast-varying network conditions, limiting their applicability for ultra-reliable low latency communications. This work addresses uplink delay minimization in multi-RIS scenarios with heterogeneous per-user latency and reliability demands. We propose Delay-Aware RIS Orchestrator (DARIO), an O-RAN-compliant framework that dynamically assigns RIS devices to users within short time windows, adapting to traffic fluctuations to meet per-user delay and reliability targets. DARIO relies on a novel Stochastic Network Calculus (SNC) model to analytically estimate the delay bound for each possible user-RIS assignment under specific traffic and service dynamics. These estimations are used by DARIO to formulate a Nonlinear Integer Program (NIP), for which an online heuristic provides near-optimal performance with low computational overhead. Extensive evaluations with simulations and real traffic traces show consistent delay reductions up to 95.7% under high load or RIS availability.

2602.17198Feb 2026

View

Multiple Index Merge for Approximate Nearest Neighbor Search

Approximate $k$ nearest neighbor (AKNN) search in high-dimensional space is a foundational problem in vector databases with widespread applications. Among the numerous AKNN indexes, Proximity Graph-based indexes achieve state-of-the-art search efficiency across various benchmarks. However, their extensive distance computations of high-dimensional vectors lead to slow construction and substantial memory overhead. The limited memory capacity often prevents building the entire index at once when handling large-scale datasets. A common practice is to build multiple sub-indexes separately. However, directly searching on these separated indexes severely compromises search efficiency, as queries cannot leverage cross-graph connections. Therefore, efficient graph index merging is crucial for multi-index searching. In this paper, we focus on efficient two-index merging and the merge order of multiple indexes for AKNN search. To achieve this, we propose a reverse neighbor sliding merge (RNSM) that exploits structural information to boost merging efficiency. We further investigate merge order selection (MOS) to reduce the merging cost by eliminating redundant merge operations. Experiments show that our approach yields up to a 5.48$\times$ speedup over existing index merge methods and 9.92$\times$ speedup over index reconstruction, while maintaining expected superior search performance. Moreover, our method scales efficiently to 100 million vectors with 50 partitions, maintaining consistent speedups.

2602.17099Feb 2026

View

Robust and Extensible Measurement of Broadband Plans with BQT+

Independent, street address-level broadband data is essential for evaluating Internet infrastructure investments, such as the $42B Broadband Equity, Access, and Deployment (BEAD) program. Evaluating these investments requires longitudinal visibility into broadband availability, quality, and affordability, including data on pre-disbursement baselines and changes in providers' advertised plans. While such data can be obtained through Internet Service Provider (ISP) web interfaces, these workloads impose three fundamental system requirements: robustness to frequent interface evolution, extensibility across hundreds of providers, and low technical overhead for non-expert users. Existing systems fail to meet these three essential requirements. We present BQT+, a broadband plan measurement framework that replaces monolithic workflows with declarative state/action specifications. BQT+ models querying intent as an interaction state space, formalized as an abstract nondeterministic finite automaton (NFA), and selects execution paths at runtime to accommodate alternative interaction flows and localized interface changes. We show that BQT+ sustains longitudinal monitoring of 64 ISPs, supporting querying for over 100 ISPs. We apply it to two policy studies: constructing a BEAD pre-disbursement baseline and benchmarking broadband affordability across over 124,000 addresses in four states.

2602.16969Feb 2026

View

HyRA: A Hybrid Resource Allocation Framework for RAN Slicing

The advent of 5G and the emergence of 6G networks demand unprecedented flexibility and efficiency in Radio Access Network (RAN) resource management to satisfy diverse service-level agreements (SLAs). Existing RAN slicing frameworks predominantly rely on per-slice resource reservation, which ensures performance isolation but leads to inefficient utilization, particularly under bursty traffic. We introduce HyRA, a hybrid resource allocation framework for RAN slicing that combines dedicated per-slice allocations with shared resource pooling across slices. HyRA preserves performance isolation while improving resource efficiency by leveraging multiplexing gains in bursty traffic conditions. We formulate this design as a bi-level stochastic optimization problem, where the outer loop determines the dedicated and shared resource budgets and the inner loop performs per-UE scheduling under a novel water-filling approach. By using the sample-average approximation, the Karush-Kuhn-Tucker (KKT) conditions of the inner loop, and Big-M encoding, we transform the problem into a tractable mixed-integer program that standard optimization solvers can solve. Extensive simulations under diverse demand patterns, SLA configurations, and traffic burstiness show that HyRA achieves up to 50-75% spectrum savings compared to dedicated-only and shared-only baselines. These results highlight HyRA as a viable approach for resource-efficient, SLA-compliant RAN slicing in future mobile networks.

2602.16952Feb 2026

View

Heterogeneous Federated Fine-Tuning with Parallel One-Rank Adaptation

Large Language Models (LLMs) have demonstrated remarkable effectiveness in adapting to downstream tasks through fine-tuning. Federated Learning (FL) extends this capability by enabling collaborative fine-tuning across distributed clients using Low-Rank Adaptation (LoRA), while preserving data privacy by avoiding raw data sharing. However, practical deployments face challenges when clients have heterogeneous resources and thus adopt different LoRA ranks, leading to substantial initialization and aggregation noise that undermines performance. To address these challenges, we propose Fed-PLoRA, a novel lightweight heterogeneous federated fine-tuning (FFT) framework. Fed-PLoRA introduces Parallel One-Rank Adaptation (PLoRA), a new LoRA variant that replaces the classic multi-rank LoRA module with multiple parallel one-rank modules, and a novel Select-N-Fold strategy that folds untrained PLoRA modules into the pre-trained weights before local training, thereby accommodating heterogeneous client resources. We provide a unified analysis of initialization and aggregation noise of Fed-PLoRA and demonstrate how it addresses the limitations of state-of-the-art methods. Extensive experiments on diverse LLM fine-tuning tasks demonstrate that Fed-PLoRA consistently outperforms existing methods in both accuracy and efficiency. The code is available at https://github.com/TNI-playground/Fed-PLoRA.

2602.16936Feb 2026

View

Read-Modify-Writable Snapshots from Read/Write operations

In the context of asynchronous concurrent shared-memory systems, a snapshot algorithm allows failure-prone processes to concurrently and atomically write on the entries of a shared array MEM , and also atomically read the whole array. Recently, Read-Modify-Writable (RMWable) snapshot was proposed, a variant of snapshot that allows processes to perform operations more complex than just read and write, specifically, each entry MEM[k] is an arbitrary readable object. The known RMWable snapshot algorithms heavily rely on powerful low-level operations such as compare&swap or load-link/store-conditional to correctly produce snapshots of MEM. Following the large body of research devoted to understand the limits of what can be solved using the simple read/write low-level operations, which are known to be strictly weaker than compare&swap and load-link/store-conditional, we explore if RMWable snapshots are possible using only read/write operations. We present two read/write RMWable snapshot algorithms, the first one in the standard concurrent shared-memory model where the number of processes n is finite and known in advance, and the second one in a variant of the standard model with unbounded concurrency, where there are infinitely many processes, but at any moment only finitely many processes participate in an execution.

2602.16903Feb 2026

View

GDEV-AI: A Generalized Evaluation of Deep Learning Inference Scaling and Architectural Saturation

The deployment of deep learning inference in production environments continues to grow, where throughput, latency, and hardware efficiency are critical. Although specialized accelerators are increasingly adopted, many inference workloads still run on CPU-only systems, particularly in legacy data centers and cost-sensitive environments. This study investigates the scalability limits of CPU-based inference for convolutional neural networks by benchmarking ResNet models across varying batch sizes on two hardware tiers: a legacy Intel Xeon E5-2403 v2 processor and a modern Intel Xeon 6 "Granite Rapids" platform. Results show that legacy CPUs quickly reach throughput saturation, with limited scaling beyond small batch sizes due to instruction-level and memory constraints. In contrast, the Granite Rapids system leverages Intel Advanced Matrix Extensions (AMX) to achieve substantially higher throughput. However, oversubscription beyond physical core limits introduces execution contention and tail-latency amplification, revealing a performance degradation regime in modern architectures. We introduce GDEV-AI, a reproducible benchmarking framework for analyzing scalability behavior and architectural saturation in CPU-based inference. By establishing a vendor-neutral baseline, this work provides empirical insight into performance bottlenecks and informs capacity planning in heterogeneous data center environments.

2602.16858Feb 2026

View

Page 1 of 382

Systems & Networking Papers - ScienceStack

Trending in Systems & Networking

DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows

2602.16585

Feb 2026Databases

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

2602.13692

Feb 2026Operating Systems

Fine-Tuning GPT-5 for GPU Kernel Generation

2602.11000

Feb 2026Distributed, Parallel, and Cluster Computing

Benchmarking Large Language Models for Knowledge Graph Validation

2602.10748

Feb 2026Databases

vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models

2602.02204

Feb 2026Distributed, Parallel, and Cluster Computing

The Stretto Execution Engine for LLM-Augmented Data Systems

2602.04430

Feb 2026Databases

PROBE: Co-Balancing Computation and Communication in MoE Inference via Real-Time Predictive Prefetching

2602.00509

Jan 2026Distributed, Parallel, and Cluster Computing

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

2601.20309

Jan 2026Distributed, Parallel, and Cluster Computing

Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers

2601.19092

Jan 2026Distributed, Parallel, and Cluster Computing

MegaFlow: Large-Scale Distributed Orchestration System for the Agentic Era

2601.07526

Jan 2026Distributed, Parallel, and Cluster Computing

OrchestrRL: Dynamic Compute and Network Orchestration for Disaggregated RL

2601.01209

Jan 2026Distributed, Parallel, and Cluster Computing

MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability

2601.00481

Jan 2026Networking and Internet Architecture

RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure

2512.225601

Dec 2025Distributed, Parallel, and Cluster Computing

ECCO: Leveraging Cross-Camera Correlations for Efficient Live Video Continuous Learning

2512.11727

Dec 2025Distributed, Parallel, and Cluster Computing

RDMA Point-to-Point Communication for LLM Systems

2510.276562

Oct 2025Distributed, Parallel, and Cluster Computing

7,635 papers

EDRP: Enhanced Dynamic Relay Point Protocol for Data Dissemination in Multi-hop Wireless IoT Networks

2602.17619Feb 2026

View

2602.17610

Exploring Novel Data Storage Approaches for Large-Scale Numerical Weather Prediction

2602.17610Feb 2026

View

TopoSZp: Lightweight Topology-Aware Error-controlled Compression for Scientific Data

2602.17552Feb 2026

View

2602.17541

Informative Trains: A Memory-Efficient Journey to a Self-Stabilizing Leader Election Algorithm in Anonymous Graphs

2602.17541Feb 2026

View

HAP Networks for the Future: Applications in Sensing, Computing, and Communication

2602.17534Feb 2026

View

ACOS: Arrays of Cheap Optical Switches

2602.17449Feb 2026

View

Voice-Driven Semantic Perception for UAV-Assisted Emergency Networks

2602.17394Feb 2026

View

End-to-End Latency Measurement Methodology for Connected and Autonomous Vehicle Teleoperation

2602.17381Feb 2026

View

Do GPUs Really Need New Tabular File Formats?

2602.17335Feb 2026

View

Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads

2602.17318Feb 2026

View

Visual Insights into Agentic Optimization of Pervasive Stream Processing Services

2602.17282Feb 2026

View

Trivance: Latency-Optimal AllReduce by Shortcutting Multiport Networks

2602.17254Feb 2026

View

Hierarchical Edge-Cloud Task Offloading in NTN for Remote Healthcare

2602.17209Feb 2026

View

RIS Control through the Lens of Stochastic Network Calculus: An O-RAN Framework for Delay-Sensitive 6G Applications

2602.17198Feb 2026

View

Multiple Index Merge for Approximate Nearest Neighbor Search

2602.17099Feb 2026

View

Robust and Extensible Measurement of Broadband Plans with BQT+

2602.16969Feb 2026

View

HyRA: A Hybrid Resource Allocation Framework for RAN Slicing

2602.16952Feb 2026

View

Heterogeneous Federated Fine-Tuning with Parallel One-Rank Adaptation

2602.16936Feb 2026

View

Read-Modify-Writable Snapshots from Read/Write operations

2602.16903Feb 2026

View

GDEV-AI: A Generalized Evaluation of Deep Learning Inference Scaling and Architectural Saturation

2602.16858Feb 2026

View

Page 1 of 382