OCTOPINF: Workload-Aware Inference Serving for Edge Video Analytics

Thanh-Tung Nguyen; Lucas Liebe; Nhat-Quang Tau; Yuheng Wu; Jinghan Cheng; Dongman Lee

OCTOPINF: Workload-Aware Inference Serving for Edge Video Analytics

Thanh-Tung Nguyen, Lucas Liebe, Nhat-Quang Tau, Yuheng Wu, Jinghan Cheng, Dongman Lee

TL;DR

OCTOPINF tackles the core challenges of edge video analytics by jointly optimizing dynamic batching, cross-device workload distribution, and co-location-aware GPU scheduling. It decomposes the complex edge-inference problem into tractable subproblems via Cross-device Workload Distributor (Cwd) and Co-location Inference Spatiotemporal Scheduler (Coral), unified under a runtime AutoScaler. The approach yields substantial gains in effective throughput and maintains tight latency under fluctuating workloads and network conditions, demonstrated on a real-world Edge-GPU testbed. Its flexible architecture supports heterogeneous inference platforms and scalable deployment, offering practical impact for real-time EVA at the edge. The work contributes a concrete, near-real-time optimization framework for edge inference that can adapt to changing content, bandwidth, and hardware configurations.

Abstract

Edge Video Analytics (EVA) has gained significant attention as a major application of pervasive computing, enabling real-time visual processing. EVA pipelines, composed of deep neural networks (DNNs), typically demand efficient inference serving under stringent latency requirements, which is challenging due to the dynamic Edge environments (e.g., workload variability and network instability). Moreover, EVA pipelines also face significant resource contention caused by resource (e.g., GPU) constraints at the Edge. In this paper, we introduce OCTOPINF, a novel resource-efficient and workload-aware inference serving system designed for real-time EVA. OCTOPINF tackles the unique challenges of dynamic edge environments through fine-grained resource allocation, adaptive batching, and workload balancing between edge devices and servers. Furthermore, we propose a spatiotemporal scheduling algorithm that optimizes the co-location of inference tasks on GPUs, improving performance and ensuring service-level objectives (SLOs) compliance. Extensive evaluations on a real-world testbed demonstrate the effectiveness of our approach. It achieves an effective throughput increase of up to 10x compared to the baselines and shows better robustness in challenging scenarios. OCTOPINF can be used for any DNN-based EVA inference task with minimal adaptation and is available at https://github.com/tungngreen/PipelineScheduler.

OCTOPINF: Workload-Aware Inference Serving for Edge Video Analytics

TL;DR

Abstract

OCTOPINF: Workload-Aware Inference Serving for Edge Video Analytics

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)