Table of Contents
Fetching ...

Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication

Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis

TL;DR

This paper tackles the challenge of deploying transformer-based semantic communication on resource-constrained edge devices by introducing a training-free adaptive token merging mechanism for Vision Transformers. By per-layer adjusting the number of tokens merged and formulating the problem as a multi-objective optimization of accuracy versus computation, the authors construct Pareto fronts via Gaussian process-based Bayesian optimization to enable runtime adaptation without retraining. Empirical results on ImageNet with a ViT backbone show significant reductions in FLOPs while preserving task performance across a range of SNR conditions, and demonstrate that adaptive policies respond to channel quality to balance throughput and fidelity. The work provides a practical, plug-and-play approach for scalable edge intelligence in 6G, with potential extensions to multimodal transformers and energy-aware objectives.

Abstract

Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.

Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication

TL;DR

This paper tackles the challenge of deploying transformer-based semantic communication on resource-constrained edge devices by introducing a training-free adaptive token merging mechanism for Vision Transformers. By per-layer adjusting the number of tokens merged and formulating the problem as a multi-objective optimization of accuracy versus computation, the authors construct Pareto fronts via Gaussian process-based Bayesian optimization to enable runtime adaptation without retraining. Empirical results on ImageNet with a ViT backbone show significant reductions in FLOPs while preserving task performance across a range of SNR conditions, and demonstrate that adaptive policies respond to channel quality to balance throughput and fidelity. The work provides a practical, plug-and-play approach for scalable edge intelligence in 6G, with potential extensions to multimodal transformers and energy-aware objectives.

Abstract

Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.

Paper Structure

This paper contains 11 sections, 18 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the proposed edge-to-cloud semantic communication system. An input image is partitioned into patches and projected into embeddings, which are processed by a pretrained transformer encoder augmented with a training-free token merging module. The resulting semantic tokens are compressed by a JSCC encoder and transmitted over a noisy wireless channel. The server reconstructs the tokens using a JSCC decoder and performs task-specific inference.
  • Figure 2: Accuracy vs. GFLOPs trade-off across all evaluated configurations. Gray dots indicate all sampled configurations during Bayesian optimization. Red circles denote Pareto-optimal configurations discovered by our method.
  • Figure 3: Top-1 accuracy across varying SNR levels for the highest accuracy Pareto configuration compared to the uncompressed model.
  • Figure 4: GFLOPs and accuracy versus SNR for the different merging strategies
  • Figure 5: Left: Original image. Center: High-accuracy configuration sampled from the Pareto front. Right: Lower-complexity configuration from the Pareto front showing more aggressive merging.