Table of Contents
Fetching ...

Adaptive Semantic Token Selection for AI-native Goal-oriented Communications

Alessio Devoto, Simone Petruzzi, Jary Pomponi, Paolo Di Lorenzo, Simone Scardapane

TL;DR

This work addresses dynamic bandwidth and computation constraints in AI-native goal-oriented communications by combining a transformer-based deep JSCC pipeline with a trainable, per-input token selection mechanism. A budget token, threshold gates, and per-layer halting scores enable adaptive token dropping under a budget $\\alpha \in [0,1]$, optimized via a penalty on $(T(x) - \alpha T)^2$ and trained across random budgets. The approach yields a single model that maintains high task accuracy across a range of latency and bandwidth constraints and provides interpretable token-discard masks that reflect semantic content. Empirical results on Imagenette with a DeiT backbone show robust performance under noisy channels and clear interpretability of the token-selection process, highlighting practical benefits for flexible AI-native communication systems.

Abstract

In this paper, we propose a novel design for AI-native goal-oriented communications, exploiting transformer neural networks under dynamic inference constraints on bandwidth and computation. Transformers have become the standard architecture for pretraining large-scale vision and text models, and preliminary results have shown promising performance also in deep joint source-channel coding (JSCC). Here, we consider a dynamic model where communication happens over a channel with variable latency and bandwidth constraints. Leveraging recent works on conditional computation, we exploit the structure of the transformer blocks and the multihead attention operator to design a trainable semantic token selection mechanism that learns to select relevant tokens (e.g., image patches) from the input signal. This is done dynamically, on a per-input basis, with a rate that can be chosen as an additional input by the user. We show that our model improves over state-of-the-art token selection mechanisms, exhibiting high accuracy for a wide range of latency and bandwidth constraints, without the need for deploying multiple architectures tailored to each constraint. Last, but not least, the proposed token selection mechanism helps extract powerful semantics that are easy to understand and explain, paving the way for interpretable-by-design models for the next generation of AI-native communication systems.

Adaptive Semantic Token Selection for AI-native Goal-oriented Communications

TL;DR

This work addresses dynamic bandwidth and computation constraints in AI-native goal-oriented communications by combining a transformer-based deep JSCC pipeline with a trainable, per-input token selection mechanism. A budget token, threshold gates, and per-layer halting scores enable adaptive token dropping under a budget , optimized via a penalty on and trained across random budgets. The approach yields a single model that maintains high task accuracy across a range of latency and bandwidth constraints and provides interpretable token-discard masks that reflect semantic content. Empirical results on Imagenette with a DeiT backbone show robust performance under noisy channels and clear interpretability of the token-selection process, highlighting practical benefits for flexible AI-native communication systems.

Abstract

In this paper, we propose a novel design for AI-native goal-oriented communications, exploiting transformer neural networks under dynamic inference constraints on bandwidth and computation. Transformers have become the standard architecture for pretraining large-scale vision and text models, and preliminary results have shown promising performance also in deep joint source-channel coding (JSCC). Here, we consider a dynamic model where communication happens over a channel with variable latency and bandwidth constraints. Leveraging recent works on conditional computation, we exploit the structure of the transformer blocks and the multihead attention operator to design a trainable semantic token selection mechanism that learns to select relevant tokens (e.g., image patches) from the input signal. This is done dynamically, on a per-input basis, with a rate that can be chosen as an additional input by the user. We show that our model improves over state-of-the-art token selection mechanisms, exhibiting high accuracy for a wide range of latency and bandwidth constraints, without the need for deploying multiple architectures tailored to each constraint. Last, but not least, the proposed token selection mechanism helps extract powerful semantics that are easy to understand and explain, paving the way for interpretable-by-design models for the next generation of AI-native communication systems.
Paper Structure (9 sections, 8 equations, 8 figures)

This paper contains 9 sections, 8 equations, 8 figures.

Figures (8)

  • Figure 1: Overview of the proposed method. Each transformer block is preceded by a trainable token selection block, which leverages a user-provided runtime budget to discard tokens. Different budgets result in different model behaviours. During training, the budget is randomly selected, at inference the budget is selected by the user.
  • Figure 2: Detail of the token selection module. The module learns a threshold $\gamma_k$ from the budget token and a halting score $s_i$ for other tokens. Tokens are discarded if their score is lower than the learned threshold.
  • Figure 3: Accuracy-efficiency trade-off (in FLOPs) for the proposed method compared to other baselines, in noiseless setup. Unlike other methods that require fine-tuning for each budget (separate markers), our method delivers a single model for all possible budgets, represented as a continuous line.
  • Figure 4: Accuracy of the proposed method when trained with the local penalty. We impose a budget constraint on the number of discarded tokens (bandwidth), represented on the x-axis. We do not show the baselines here as they do not allow for adaptive badnwidth selection.
  • Figure 5: Accuracy of the proposed method across different SNRs. Each line represents a budget, imposed at inference via the budget token. If the current channel condition is known, selecting the right budget can save resources.
  • ...and 3 more figures