6G EdgeAI: Performance Evaluation and Analysis
Chien-Sheng Yang, Yu-Jen Ku, Yuan-Yao Lou, Nathan Tenny, Alex C. -C. Hsu
TL;DR
This paper tackles the latency challenges of GenAI workloads in 6G by proposing Integrated Communication and Computing (ICC), a framework that colocates computing near the RAN and jointly optimizes communication and computation. Through a queueing-theoretic model, ICC shows a $98\%$ higher service capacity than 5G MEC, and system-level simulations with transformer-based LLM workloads demonstrate a $60\%$ reduction in end-to-end latency and $27\%$ lower compute costs, especially when employing a priority-based joint latency management strategy. The analysis leverages a tandem $M/M/1$ model for the communication and computing stages, with FCFS discipline and independence of stage sojourn times, and validates findings using realistic GPU configurations and LLM inference models. The results indicate that ICC is a practical and scalable path to delivering real-time GenAI services at 6G network edges, with potential applicability to other latency-sensitive applications and further gains through system-wide offline and online offloading optimizations.
Abstract
Generative AI (GenAI) services powered by large language models (LLMs) increasingly deliver real-time interactions, yet existing 5G multi-access edge computing (MEC) architectures often treat communication and computing as separate domains, limiting their ability to meet stringent latency requirements. To address this challenge, we introduce an Integrated Communication and Computing (ICC) framework where computing capabilities are enabled to reside directly in radio access network (RAN) nodes and jointly manage bandwidth and computing resources. Our queueing-theoretic analysis shows that ICC outperforms 5G MEC, achieving higher service capacity (defined as the maximum arrival rate that maintains a specified fraction of jobs completed within a given delay budget) by 98%. We corroborate these gains through system-level simulations that account for transformer-based LLM workloads, realistic GPU specifications, and a priority-based scheduling scheme. The simulations show that ICC improves service capacity by 60%, demonstrating its potential to enable efficient, cost-effective real-time GenAI services in 6G.
