Table of Contents
Fetching ...

Distributed Collaborative Inference System in Next-Generation Networks and Communication

Chuan Zhang, Xixi Zheng, Xiaolong Tao, Chenfei Hu, Weiting Zhang, Liehuang Zhu

TL;DR

The paper addresses the challenge of running high-demand generative AI tasks on resource-constrained devices in 6G networks by proposing a cloud-edge-end, multi-level collaborative inference framework. It combines deployment logic across three network levels, confidence-based offloading, attention-based input pruning, ensemble fusion, and transformer-optimized early exit to reduce latency while preserving accuracy. Empirical results on IMDB sentiment analysis using BERT variants show latency reductions up to 17% with comparable accuracy, and provide guidance on hyperparameter settings for different application needs. The framework is versatile and scalable, adaptable to diverse devices, network conditions, and model architectures, promising practical improvements for real-world mobile AI inference.

Abstract

With the rapid advancement of artificial intelligence, generative artificial intelligence (GAI) has taken a leading role in transforming data processing methods. However, the high computational demands of GAI present challenges for devices with limited resources. As we move towards the sixth generation of mobile networks (6G), the higher data rates and improved energy efficiency of 6G create a need for more efficient data processing in GAI. Traditional GAI, however, shows its limitations in meeting these demands. To address these challenges, we introduce a multi-level collaborative inference system designed for next-generation networks and communication. Our proposed system features a deployment strategy that assigns models of varying sizes to devices at different network layers. Then, we design a task offloading strategy to optimise both efficiency and latency. Furthermore, a modified early exit mechanism is implemented to enhance the inference process for single models. Experimental results demonstrate that our system effectively reduces inference latency while maintaining high-quality output. Specifically, compared to existing work, our system can reduce inference time by up to 17% without sacrificing the inference accuracy.

Distributed Collaborative Inference System in Next-Generation Networks and Communication

TL;DR

The paper addresses the challenge of running high-demand generative AI tasks on resource-constrained devices in 6G networks by proposing a cloud-edge-end, multi-level collaborative inference framework. It combines deployment logic across three network levels, confidence-based offloading, attention-based input pruning, ensemble fusion, and transformer-optimized early exit to reduce latency while preserving accuracy. Empirical results on IMDB sentiment analysis using BERT variants show latency reductions up to 17% with comparable accuracy, and provide guidance on hyperparameter settings for different application needs. The framework is versatile and scalable, adaptable to diverse devices, network conditions, and model architectures, promising practical improvements for real-world mobile AI inference.

Abstract

With the rapid advancement of artificial intelligence, generative artificial intelligence (GAI) has taken a leading role in transforming data processing methods. However, the high computational demands of GAI present challenges for devices with limited resources. As we move towards the sixth generation of mobile networks (6G), the higher data rates and improved energy efficiency of 6G create a need for more efficient data processing in GAI. Traditional GAI, however, shows its limitations in meeting these demands. To address these challenges, we introduce a multi-level collaborative inference system designed for next-generation networks and communication. Our proposed system features a deployment strategy that assigns models of varying sizes to devices at different network layers. Then, we design a task offloading strategy to optimise both efficiency and latency. Furthermore, a modified early exit mechanism is implemented to enhance the inference process for single models. Experimental results demonstrate that our system effectively reduces inference latency while maintaining high-quality output. Specifically, compared to existing work, our system can reduce inference time by up to 17% without sacrificing the inference accuracy.

Paper Structure

This paper contains 27 sections, 14 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Examples of services provided by multi-level collaborative inference in mobile networks.
  • Figure 2: Inference workflow of our system.
  • Figure 3: Accelerating of single model inference.
  • Figure 4: Confidence vs. offloading probability curve.
  • Figure 5: Simple structure of the transformer encoder.
  • ...and 4 more figures