Table of Contents
Fetching ...

Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference

Cheng Yuan, Zhening Liu, Jiashu Lv, Jiawei Shao, Yufei Jiang, Jun Zhang, Xuelong Li

TL;DR

A task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection, thereby minimizing both data transmission and computational complexity.

Abstract

With the rapid development of large multimodal models (LMMs), multimodal understanding applications are emerging. As most LMM inference requests originate from edge devices with limited computational capabilities, the predominant inference pipeline involves directly forwarding the input data to an edge server which handles all computations. However, this approach introduces high transmission latency due to limited uplink bandwidth of edge devices and significant computation latency caused by the prohibitive number of visual tokens, thus hindering delay-sensitive tasks and degrading user experience. To address this challenge, we propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection. Specifically, we employ density peaks clustering based on K nearest neighbors to reduce the number of visual features, thereby minimizing both data transmission and computational complexity. Subsequently, a learnable entropy model with hyperprior is utilized to encode and decode merged features, further reducing transmission overhead. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features, enabling a more accurate estimation of the probability distribution. Comprehensive experiments on seven visual question answering benchmarks validate the effectiveness of the proposed TOFC method. Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency while maintaining identical task performance, compared with neural compression ELIC.

Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference

TL;DR

A task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection, thereby minimizing both data transmission and computational complexity.

Abstract

With the rapid development of large multimodal models (LMMs), multimodal understanding applications are emerging. As most LMM inference requests originate from edge devices with limited computational capabilities, the predominant inference pipeline involves directly forwarding the input data to an edge server which handles all computations. However, this approach introduces high transmission latency due to limited uplink bandwidth of edge devices and significant computation latency caused by the prohibitive number of visual tokens, thus hindering delay-sensitive tasks and degrading user experience. To address this challenge, we propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection. Specifically, we employ density peaks clustering based on K nearest neighbors to reduce the number of visual features, thereby minimizing both data transmission and computational complexity. Subsequently, a learnable entropy model with hyperprior is utilized to encode and decode merged features, further reducing transmission overhead. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features, enabling a more accurate estimation of the probability distribution. Comprehensive experiments on seven visual question answering benchmarks validate the effectiveness of the proposed TOFC method. Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency while maintaining identical task performance, compared with neural compression ELIC.

Paper Structure

This paper contains 26 sections, 19 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: An example of the proposed device-edge co-inference system for multimodal understanding tasks, where the user device extracts and compresses visual features, and the intensive computation of LLM inference is handled by the edge server.
  • Figure 2: System diagram of the proposed feature compression method. The user device executes vision encoding, feature merging, and entropy encoding for the visual input, while the edge server performs entropy decoding, feature projection, and autoregressive generation of the LLM. The feature merging module reduces the number of the visual features to reduce both data transmission overhead and computational complexity, lowering the end-to-end system latency. The learnable and selective entropy model with a hyperprior is utilized to encode and decode the merged features, further reducing data transmission overhead.
  • Figure 3: Network architecture of the learnable entropy model. The hyperprior analysis network $h_{\rm{a}}$ extracts the hyperprior $\boldsymbol{z}$ from the merged features $\boldsymbol{y}$. The hyperprior synthesis network $h_{\rm s}$ estimates the mean and scale of the merged features from the quantized hyperprior $\bar{\boldsymbol{z}}$. FC represents a fully connected layer, where the two parameters in parentheses denote input and output dimensions, respectively. AE and AD represent arithmetic encoder and decoder, respectively. $\lfloor \cdot \rceil$ denotes rounding to the nearest integer.
  • Figure 4: The measured PDF of the visual features, in comparison to Gaussian, Laplacian, and Cauchy distributions with mean and scale obtained from maximum likelihood estimation.
  • Figure 5: Network architecture of the selective entropy model (w/ temp.: with temperature). The router network analyzes each merged feature $\boldsymbol{y}_{n,i}$ and generates a score vector $\boldsymbol{s}_{n,i}$, which is processed by a softmax function with temperature $T$ to compute the weighting coefficients for $n_{\rm e}$ entropy models. During training, the final quantized features $\bar{\boldsymbol{y}}_{n,i}$ are the weighted sum of the decoding results from each entropy model. During inference, with $T \rightarrow 0$, only the entropy model with the highest score is used for encoding and decoding.
  • ...and 7 more figures