Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Jingxuan Men; Mahdi Boloursaz Mashhadi; Ning Wang; Yi Ma; Mike Nilsson; Rahim Tafazolli

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Jingxuan Men, Mahdi Boloursaz Mashhadi, Ning Wang, Yi Ma, Mike Nilsson, Rahim Tafazolli

TL;DR

A novel Video TokenCom framework for textual intent-guided multi-rate video communication with Unequal Error Protection (UEP)-based source-channel coding adaptation and a semantic-aware multi-rate bit-allocation strategy, in which tokens highly related to the user intent are encoded using full codebook precision.

Abstract

Token Communication (TokenCom) is a new paradigm, motivated by the recent success of Large AI Models (LAMs) and Multimodal Large Language Models (MLLMs), where tokens serve as unified units of communication and computation, enabling efficient semantic- and goal-oriented information exchange in future wireless networks. In this paper, we propose a novel Video TokenCom framework for textual intent-guided multi-rate video communication with Unequal Error Protection (UEP)-based source-channel coding adaptation. The proposed framework integrates user-intended textual descriptions with discrete video tokenization and unequal error protection to enhance semantic fidelity under restrictive bandwidth constraints. First, discrete video tokens are extracted through a pretrained video tokenizer, while text-conditioned vision-language modeling and optical-flow propagation are jointly used to identify tokens that correspond to user-intended semantics across space and time. Next, we introduce a semantic-aware multi-rate bit-allocation strategy, in which tokens highly related to the user intent are encoded using full codebook precision, whereas non-intended tokens are represented through reduced codebook precision differential encoding, enabling rate savings while preserving semantic quality. Finally, a source and channel coding adaptation scheme is developed to adapt bit allocation and channel coding to varying resources and link conditions. Experiments on various video datasets demonstrate that the proposed framework outperforms both conventional and semantic communication baselines, in perceptual and semantic quality on a wide SNR range.

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

TL;DR

Abstract

Paper Structure (25 sections, 48 equations, 7 figures, 3 tables, 3 algorithms)

This paper contains 25 sections, 48 equations, 7 figures, 3 tables, 3 algorithms.

Introduction
System Model
Video Tokenization
Multi-modal User-intended Token Extraction
Text-conditioned Heatmap Generation
Dynamic Optical Flow Propagation
Content Discrete Token Mapping
Semantic-aware Multi-rate Bit Coding
Full Codebook Precision Encoding of Intended Tokens
Reduced Codebook Precision Differential Encoding of Non-Intended Tokens
Token-level intended ratio and BPP analysis
UEP-based Joint Distortion and Transmission Delay Minimization
UEP candidate sets
PDU aggregation and per-candidate metrics
Class-level decision variables
...and 10 more sections

Figures (7)

Figure 1: Overall architecture of the proposed Video Token Communications framework with Multi-rate Textual Intent Source-Channel Coding Adaptation. The pipeline consists of: (1) proposed token-based textual intent-guided source encoder with multimodal user-intended token extraction and semantic-aware multi-rate bit coding; (2) proposed token-based source decoder; and (3) proposed UEP-based source channel coding/decoding adaptation.
Figure 2: Textual intent guided multiclass multirate token mapping.
Figure 3: Performance of the proposed video TokenCom framework with different textual intents: "The woman is hitting the man's mobile phone." and "Sky.". The bit-precision for transmitting user intended and non-intended tokens is 16 and 11 bits per token, respectively. The red rectangle shows the user-intended regions guided by the textual intent.
Figure 4: Performance of the proposed video TokenCom framework. The red rectangle shows the user-intended regions guided by the textual intent "Car and person."
Figure 5: The comparison results with benchmark under the MCL-JCV (a) and UVG (b) datasets. Note that some H.265 points are not shown at low SNRs because the adaptive H.265 pipeline frequently "Failed" to decode more than 85% of the frames, and such cases are marked as invalid in our evaluation.
...and 2 more figures

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

TL;DR

Abstract

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)