AI Flow at the Network Edge
Jiawei Shao, Xuelong Li
TL;DR
AI Flow addresses the challenge of running large, multimodal models at the network edge by introducing a holistic edge-edge-cloud cooperative framework. It articulates a system architecture with edge devices, edge servers, and cloud servers, plus three inference paradigms that adapt to resource and network conditions. The paper provides theoretical insights grounded in the information bottleneck to justify cooperative inference and outlines enabling techniques such as split inference and speculative decoding, complemented by a case study on image captioning showing near-d cloud-level quality with reduced latency. The work demonstrates the practical viability of distributing intelligence across the edge, enabling low-latency, task-oriented communications and highlighting avenues for security, co-design, and scalability in future work.
Abstract
Recent advancements in large language models (LLMs) and their multimodal variants have led to remarkable progress across various domains, demonstrating impressive capabilities and unprecedented potential. In the era of ubiquitous connectivity, leveraging communication networks to distribute intelligence is a transformative concept, envisioning AI-powered services accessible at the network edge. However, pushing large models from the cloud to resource-constrained environments faces critical challenges. Model inference on low-end devices leads to excessive latency and performance bottlenecks, while raw data transmission over limited bandwidth networks causes high communication overhead. This article presents AI Flow, a framework that streamlines the inference process by jointly leveraging the heterogeneous resources available across devices, edge nodes, and cloud servers, making intelligence flow across networks. To facilitate cooperation among multiple computational nodes, the proposed framework explores a paradigm shift in the design of communication network systems from transmitting information flow to intelligence flow, where the goal of communications is task-oriented and folded into the inference process. Experimental results demonstrate the effectiveness of the proposed framework through an image captioning use case, showcasing the ability to reduce response latency while maintaining high-quality captions. This article serves as a position paper for identifying the motivation, challenges, and principles of AI Flow.
