Table of Contents
Fetching ...

Embodied AI-Enhanced Vehicular Networks: An Integrated Large Language Models and Reinforcement Learning Method

Ruichen Zhang, Changyuan Zhao, Hongyang Du, Dusit Niyato, Jiacheng Wang, Suttinee Sawadsitang, Xuemin Shen, Dong In Kim

TL;DR

This work tackles the challenge of optimizing both data transmission and decision-making in embodied AI vehicular networks under bandwidth constraints. It couples LLAVA-based semantic extraction to compress multimodal sensor data into actionable text and a GAE-PPO–driven reinforcement learning framework to adapt transmission policies in real time, guided by a Weber-Fechner QoE metric. The key contributions include formulating a QoE-aware optimization problem, designing an LLAVA-based semantic pipeline with attention-grounded extraction, and implementing a stable GAE-PPO solver with a detailed MDP for V2I/V2V resource management. Empirical results show up to 36% QoE gains over DDPG, 47% faster convergence than pure PPO, and a 61.4% QoE improvement when scaling from 4 to 8 vehicles, validating the approach's effectiveness and scalability for future 6G IoV deployments.

Abstract

This paper investigates adaptive transmission strategies in embodied AI-enhanced vehicular networks by integrating large language models (LLMs) for semantic information extraction and deep reinforcement learning (DRL) for decision-making. The proposed framework aims to optimize both data transmission efficiency and decision accuracy by formulating an optimization problem that incorporates the Weber-Fechner law, serving as a metric for balancing bandwidth utilization and quality of experience (QoE). Specifically, we employ the large language and vision assistant (LLAVA) model to extract critical semantic information from raw image data captured by embodied AI agents (i.e., vehicles), reducing transmission data size by approximately more than 90\% while retaining essential content for vehicular communication and decision-making. In the dynamic vehicular environment, we employ a generalized advantage estimation-based proximal policy optimization (GAE-PPO) method to stabilize decision-making under uncertainty. Simulation results show that attention maps from LLAVA highlight the model's focus on relevant image regions, enhancing semantic representation accuracy. Additionally, our proposed transmission strategy improves QoE by up to 36\% compared to DDPG and accelerates convergence by reducing required steps by up to 47\% compared to pure PPO. Further analysis indicates that adapting semantic symbol length provides an effective trade-off between transmission quality and bandwidth, achieving up to a 61.4\% improvement in QoE when scaling from 4 to 8 vehicles.

Embodied AI-Enhanced Vehicular Networks: An Integrated Large Language Models and Reinforcement Learning Method

TL;DR

This work tackles the challenge of optimizing both data transmission and decision-making in embodied AI vehicular networks under bandwidth constraints. It couples LLAVA-based semantic extraction to compress multimodal sensor data into actionable text and a GAE-PPO–driven reinforcement learning framework to adapt transmission policies in real time, guided by a Weber-Fechner QoE metric. The key contributions include formulating a QoE-aware optimization problem, designing an LLAVA-based semantic pipeline with attention-grounded extraction, and implementing a stable GAE-PPO solver with a detailed MDP for V2I/V2V resource management. Empirical results show up to 36% QoE gains over DDPG, 47% faster convergence than pure PPO, and a 61.4% QoE improvement when scaling from 4 to 8 vehicles, validating the approach's effectiveness and scalability for future 6G IoV deployments.

Abstract

This paper investigates adaptive transmission strategies in embodied AI-enhanced vehicular networks by integrating large language models (LLMs) for semantic information extraction and deep reinforcement learning (DRL) for decision-making. The proposed framework aims to optimize both data transmission efficiency and decision accuracy by formulating an optimization problem that incorporates the Weber-Fechner law, serving as a metric for balancing bandwidth utilization and quality of experience (QoE). Specifically, we employ the large language and vision assistant (LLAVA) model to extract critical semantic information from raw image data captured by embodied AI agents (i.e., vehicles), reducing transmission data size by approximately more than 90\% while retaining essential content for vehicular communication and decision-making. In the dynamic vehicular environment, we employ a generalized advantage estimation-based proximal policy optimization (GAE-PPO) method to stabilize decision-making under uncertainty. Simulation results show that attention maps from LLAVA highlight the model's focus on relevant image regions, enhancing semantic representation accuracy. Additionally, our proposed transmission strategy improves QoE by up to 36\% compared to DDPG and accelerates convergence by reducing required steps by up to 47\% compared to pure PPO. Further analysis indicates that adapting semantic symbol length provides an effective trade-off between transmission quality and bandwidth, achieving up to a 61.4\% improvement in QoE when scaling from 4 to 8 vehicles.
Paper Structure (22 sections, 28 equations, 10 figures, 1 table, 2 algorithms)

This paper contains 22 sections, 28 equations, 10 figures, 1 table, 2 algorithms.

Figures (10)

  • Figure 1: The workflow of the proposed embodied AI framework for vehicular networks. The framework comprises two key functions: semantic data processing using LLAVA for efficient information extraction and enhanced decision-making via GAE-PPO to optimize transmission and decision strategies.
  • Figure 2: System model illustrates a cellular-based vehicular communication network, where embodied AI vehicles utilize semantic communication to encode and decode structured messages for efficient and reliable data exchange Shunpu_SemCom.
  • Figure 3: The architecture of the LLAVA model for semantic extraction and language embedding. The input image $I_i$ is processed by the visual encoder $g(\cdot)$ to generate feature vectors, which are then transformed through a projection matrix $W$ and processed by the LLAVA model to extract semantic information $M_i$. The BERT model $B(\cdot)$ is used to generate the final sentence-level representation, capturing key information from the visual input.
  • Figure 4: The workflow of the GAE-PPO method for optimizing transmission strategies. The actor and critic networks are updated iteratively through a combination of replay buffer sampling, temporal-difference (TD) error calculation, advantage estimation, and policy clipping. The actor network generates actions using a multivariate normal distribution, while the critic network evaluates state values, contributing to the stability and convergence of the learning process.
  • Figure 5: The illustration of semantic information extraction. In the left part, the environmental image is obtained through the onboard sensor. Then, the semantic information is extracted according to the user's needs via the embodied AI. In the right part, we demonstrate the relevance between the extracted semantic information and user needs, as well as the environmental images.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2