Quantization-Aware Collaborative Inference for Large Embodied AI Models

Zhonghao Lyu; Ming Xiao; Mikael Skoglund; Merouane Debbah; H. Vincent Poor

Quantization-Aware Collaborative Inference for Large Embodied AI Models

Zhonghao Lyu, Ming Xiao, Mikael Skoglund, Merouane Debbah, H. Vincent Poor

TL;DR

This work develops a tractable approximation for quantization-induced inference distortion, and forms a joint quantization bit-width and computation frequency design problem under delay and energy constraints, aiming to minimize the distortion upper bound while ensuring tightness through the corresponding lower bound.

Abstract

Large artificial intelligence models (LAIMs) are increasingly regarded as a core intelligence engine for embodied AI applications. However, the massive parameter scale and computational demands of LAIMs pose significant challenges for resource-limited embodied agents. To address this issue, we investigate quantization-aware collaborative inference (co-inference) for embodied AI systems. First, we develop a tractable approximation for quantization-induced inference distortion. Based on this approximation, we derive lower and upper bounds on the quantization rate-inference distortion function, characterizing its dependence on LAIM statistics, including the quantization bit-width. Next, we formulate a joint quantization bit-width and computation frequency design problem under delay and energy constraints, aiming to minimize the distortion upper bound while ensuring tightness through the corresponding lower bound. Extensive evaluations validate the proposed distortion approximation, the derived rate-distortion bounds, and the effectiveness of the proposed joint design. Particularly, simulations and real-world testbed experiments demonstrate the effectiveness of the proposed joint design in balancing inference quality, latency, and energy consumption in edge embodied AI systems.

Quantization-Aware Collaborative Inference for Large Embodied AI Models

TL;DR

Abstract

Paper Structure (25 sections, 5 theorems, 48 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 25 sections, 5 theorems, 48 equations, 9 figures, 1 table, 1 algorithm.

Introduction
Research Background and Related Work
Motivations and Contributions
System Model
On-agent Inference and Embedding Transmission
On-server Inference and Result Feedback
LAIM Parameter Distribution Modeling
Inference Delay and Energy Consumption
Delay Analysis
Energy Consumption Analysis
Model Output Distortion Approximation
Rate-distortion Analysis for Quantization
Preliminaries of Rate-distortion Analysis
A Lower Bound on the Rate-distortion function for Quantization
An Upper Bound on the Rate-distortion Function for Quantization
...and 10 more sections

Key Result

Proposition 3.1

Under Assumptions 1-3, an upper bound on the inference output distortion of an FC DNN induced by model quantization is given by where

Figures (9)

Figure 1: The considered LAIM-enabled embodied AI system.
Figure 2: Distribution of the parameter magnitudes of various pre-trained models.
Figure 3: Distortions of model outputs and parameters w.r.t. bit-width.
Figure 4: Illustration of the upper and lower bounds of the distortion-rate function.
Figure 5: Performance of BLIP-2 on MS-COCO w.r.t. different delay and energy consumption thresholds under uniform quantization with $E_0=2.00~ {\rm J}$ (left) and $T_0= 3.50 ~ {\rm s}$ (right).
...and 4 more figures

Theorems & Definitions (10)

Proposition 3.1
Remark 3.1: Distortion metric
Remark 3.2: Extension to General AI Models
Definition 4.1: Rate-distortion function cover1999elements733495
Lemma 4.1
Lemma 4.2
Proposition 4.1
Remark 4.1
Proposition 4.2
Remark 4.2

Quantization-Aware Collaborative Inference for Large Embodied AI Models

TL;DR

Abstract

Quantization-Aware Collaborative Inference for Large Embodied AI Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (10)