Table of Contents
Fetching ...

StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving

Yuan Gao, Dengyuan Hua, Mattia Piccinini, Finn Rasmus Schäfer, Korbinian Moller, Lin Li, Johannes Betz

TL;DR

StyleVLA, a physics-informed VLA framework for generating diverse and physically plausible driving behaviors, is presented, and experiments show that a specialized, physics-informed, lightweight model can surpass closed-source models on domain-specific tasks.

Abstract

Vision Language Models (VLMs) bridge visual perception and linguistic reasoning. In Autonomous Driving (AD), this synergy has enabled Vision Language Action (VLA) models, which translate high-level multimodal understanding into driving behaviors, typically represented as future trajectories. However, existing VLA models mainly generate generic collision-free trajectories. Beyond collision avoidance, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized driving. Moreover, many methods treat trajectory generation as naive token prediction, which can produce kinematically infeasible actions. To address these limitations, we present StyleVLA, a physics-informed VLA framework for generating diverse and physically plausible driving behaviors. We introduce a hybrid loss that combines a kinematic consistency constraint with a continuous regression head to improve trajectory feasibility. To train StyleVLA, built on Qwen3-VL-4B, we construct a large-scale instruction dataset with over 1.2k scenarios, 76k Bird's Eye View (BEV) samples, and 42k First Person View (FPV) samples, with ground-truth trajectories for five driving styles and natural-language instructions. Experiments show that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA models. Using a composite driving score measuring success rate, physical feasibility, and style adherence, StyleVLA achieves 0.55 on BEV and 0.51 on FPV, versus 0.32 and 0.35 for Gemini-3-Pro. These results show that a specialized, physics-informed, lightweight model can surpass closed-source models on domain-specific tasks.

StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving

TL;DR

StyleVLA, a physics-informed VLA framework for generating diverse and physically plausible driving behaviors, is presented, and experiments show that a specialized, physics-informed, lightweight model can surpass closed-source models on domain-specific tasks.

Abstract

Vision Language Models (VLMs) bridge visual perception and linguistic reasoning. In Autonomous Driving (AD), this synergy has enabled Vision Language Action (VLA) models, which translate high-level multimodal understanding into driving behaviors, typically represented as future trajectories. However, existing VLA models mainly generate generic collision-free trajectories. Beyond collision avoidance, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized driving. Moreover, many methods treat trajectory generation as naive token prediction, which can produce kinematically infeasible actions. To address these limitations, we present StyleVLA, a physics-informed VLA framework for generating diverse and physically plausible driving behaviors. We introduce a hybrid loss that combines a kinematic consistency constraint with a continuous regression head to improve trajectory feasibility. To train StyleVLA, built on Qwen3-VL-4B, we construct a large-scale instruction dataset with over 1.2k scenarios, 76k Bird's Eye View (BEV) samples, and 42k First Person View (FPV) samples, with ground-truth trajectories for five driving styles and natural-language instructions. Experiments show that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA models. Using a composite driving score measuring success rate, physical feasibility, and style adherence, StyleVLA achieves 0.55 on BEV and 0.51 on FPV, versus 0.32 and 0.35 for Gemini-3-Pro. These results show that a specialized, physics-informed, lightweight model can surpass closed-source models on domain-specific tasks.
Paper Structure (25 sections, 15 equations, 4 figures, 6 tables)

This paper contains 25 sections, 15 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Concept of StyleVLA: Enabling driving style-aware trajectory generation via model. Our framework yields diverse driving styles (Default, Balanced, Comfort, Sporty, Safety) in response to user instructions.
  • Figure 2: Overview of the StyleVLA framework. Top (Dataset Construction): A motion planner generates style-specific ground-truth trajectories to create multimodal instruction samples. Instruction Dataset Generation: Details the instruction generation process and 3D scenario replay in CARLA. Bottom (Fine-tuning Architecture): The model predicts trajectory tokens using only an LLM head conditioned on visual context and language prompts. During training, an auxiliary MLP decoder maps the predicted tokens to continuous kinematic trajectories for physics-informed supervision. Training uses a physics-informed hybrid loss ($\mathcal{L}_\mathrm{total}$) combining cross-entropy ($\mathcal{L}_\mathrm{ce}$), regression ($\mathcal{L}_\mathrm{reg}$), and kinematic consistency ($\mathcal{L}_\mathrm{pikc}$). Trajectory Generation: Shows the model's application in both 2D BEV and 3D FPV domains.
  • Figure 3: Example training dynamics of StyleVLA fine-tuning on the instruction dataset. Top: loss terms ($\mathcal{L}_\mathrm{total}$, $\mathcal{L}_\mathrm{ce}$, $\mathcal{L}_\mathrm{reg}$, $\mathcal{L}_\mathrm{pikc}$). Bottom: learned log-variance parameters ($w_\mathrm{ce}$, $w_\mathrm{reg}$) that yield adaptive precision weights via $\exp(-w)$.
  • Figure 4: Qualitative comparison of style-conditioned trajectory generation under five driving styles (Default, Balanced, Comfort, Sporty, Safety). We visualize the goal, ground truth, and predicted trajectories from pretrained and baselines (see legend). "*" failed to generate trajectories.