Table of Contents
Fetching ...

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu

TL;DR

This work defines Open-Domain End-to-End autonomous driving and introduces E3AD, an emotion-aware Vision-Language-Action model that jointly grounds language commands, infers continuous emotion via Valence-Arousal-Dominance, and reasons over egocentric and allocentric spatial representations to generate feasible trajectories. It couples three training stages—modality pretraining, joint fine-tuning, and emotion-action alignment with Direct Preference Optimization—and augments language with emotion-aware paraphrases to robustly tie emotional intent to planning. Across four real-world benchmarks, E3AD delivers state-of-the-art emotion estimation, improved visual grounding, and superior trajectory planning, with substantial end-to-end gains and favorable user-study feedback. By integrating emotion into grounding and planning, the approach yields more human-aligned behavior and greater passenger trust in autonomous driving systems.

Abstract

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

TL;DR

This work defines Open-Domain End-to-End autonomous driving and introduces E3AD, an emotion-aware Vision-Language-Action model that jointly grounds language commands, infers continuous emotion via Valence-Arousal-Dominance, and reasons over egocentric and allocentric spatial representations to generate feasible trajectories. It couples three training stages—modality pretraining, joint fine-tuning, and emotion-action alignment with Direct Preference Optimization—and augments language with emotion-aware paraphrases to robustly tie emotional intent to planning. Across four real-world benchmarks, E3AD delivers state-of-the-art emotion estimation, improved visual grounding, and superior trajectory planning, with substantial end-to-end gains and favorable user-study feedback. By integrating emotion into grounding and planning, the approach yields more human-aligned behavior and greater passenger trust in autonomous driving systems.

Abstract

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.

Paper Structure

This paper contains 53 sections, 18 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Overview of our proposed E3AD framework, contrasted with conventional VLA pipelines. (a) Existing VLA models behave as emotion-agnostic systems, mapping multi-view images directly to a planning output without human-in-the-loop interaction or emotion understanding. (b) Our model adds explicit emotion modeling and closed-loop feedback, allowing the agent to infer intent intensity, ground referents more reliably, and adapt its plan accordingly. (c) This yields the Open-Domain E2E AD task, where the agent jointly reasons over language, emotion, perception, and navigation to enable human-centered and context-aware autonomy.
  • Figure 2: Overview of E3AD and its training/inference pipeline. Given egocentric and allocentric views with a natural-language command (a), E3AD outputs emotion, grounding, and waypoint tokens via two core modules: Emotion Modeling (b) encodes commands in continuous VAD space (c), and Spatial Reasoning fuses egocentric and allocentric pathway cues. Training proceeds from Modality Pretraining for emotion/spatial skills (d) to Joint Fine-Tuning that predicts ($\hat{e}$, $\hat{b}$, $\hat{\tau}$) in a single autoregressive chain (e), followed by Emotion-Action Alignment (f). During inference (g), E3AD runs end-to-end to estimate ($\hat{e}$), ground ($\hat{b}$), and plan ($\hat{\tau}$), producing human-centric feedback.
  • Figure 3: Visualization of emotion distributions before and after augmentation. (a) Proportions of GoEmotion categories across Talk2Car splits. (b) VAD distribution of GoEmotion. (c) Incorporating driving commands enriches emotional diversity. (d) Emotion-aware augmentation expands and smooths the VAD distribution, providing broader and continuous emotion supervision.
  • Figure 4: Qualitative comparison between E3AD and FSDrive-FT in emotion-rich (a), multi-agent (b), and ambiguous (c) scenes.
  • Figure 5: DPO’s effect on emotion-trajectory consistency.
  • ...and 7 more figures