Table of Contents
Fetching ...

From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

Sining Ang, Yuguang Yang, Chenxu Dang, Canyu Chen, Cheng Chi, Haiyan Liu, Xuanyao Mao, Jason Bao, Xuliang, Bingchuan Sun, Yan Wang

TL;DR

The paper analyzes representational and behavioral differences between full vision-language models (VLMs) and vision-only backbones within an end-to-end driving framework, revealing that policy learning compresses heterogeneous backbone signals into a shared decision space while VLMs expand subspaces at the representation level. It shows that complementarity between VLM and vision-only policies is mainly long-tail and manifested in distinct driving styles, which can be exploited by trajectory-level selection rather than representation-only gating. The authors propose HybridDriveVLA and DualDriveVLA to fuse or selectively deploy the two branches, achieving PDMS improvements up to $92.10$ and faster throughput with a strong fast-path fallback, thereby turning complementarity into practical gains. This work provides a principled analysis-to-mechanism pipeline from representation isomorphism (RQ1) to trajectory-level selection (RQ3), enabling efficient, robust deployment of hybrid VLA driving systems.

Abstract

Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast--slow policy: it runs ViT by default and invokes the VLM only when the scorer's confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.

From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

TL;DR

The paper analyzes representational and behavioral differences between full vision-language models (VLMs) and vision-only backbones within an end-to-end driving framework, revealing that policy learning compresses heterogeneous backbone signals into a shared decision space while VLMs expand subspaces at the representation level. It shows that complementarity between VLM and vision-only policies is mainly long-tail and manifested in distinct driving styles, which can be exploited by trajectory-level selection rather than representation-only gating. The authors propose HybridDriveVLA and DualDriveVLA to fuse or selectively deploy the two branches, achieving PDMS improvements up to and faster throughput with a strong fast-path fallback, thereby turning complementarity into practical gains. This work provides a principled analysis-to-mechanism pipeline from representation isomorphism (RQ1) to trajectory-level selection (RQ3), enabling efficient, robust deployment of hybrid VLA driving systems.

Abstract

Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast--slow policy: it runs ViT by default and invokes the VLM only when the scorer's confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.
Paper Structure (99 sections, 65 equations, 8 figures, 8 tables)

This paper contains 99 sections, 65 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Shared vs. model-specific representation geometry after alignment. Each model's features are aligned to the ResNet-101 feature space (used as the reference) via (orthogonal) Procrustes alignment schonemann1966generalized, then visualized using a 2D PCA projection fitted on the concatenation of all aligned features. For each model, the curves are KDE probability-mass contours enclosing the smallest regions containing 60/70/80% of the KDE mass; the shaded region is a robust 95% covariance ellipse (MinCovDet), and the star denotes the robust mean. Vision-only backbones form a tightly overlapping cluster, while the VLM shares a substantial core with them yet also occupies additional regions, indicating a mixture of shared and model-specific subspaces rather than strict containment.
  • Figure 2: Backbone vs. DiT representation similarity measured by CKA. We compute pairwise linear CKA between feature representations from different model branches and visualize the resulting similarity matrices for (left) the visual backbone features and (right) the DiT features. While backbone representations show high agreement among vision-only encoders (e.g., ViT/ResNet/EVA-CLIP) but low similarity to the VLM branch, the DiT representations become markedly more aligned across all branches, with substantially increased VLM-to-vision similarity (e.g., $\sim$0.21--0.22 $\rightarrow$$\sim$0.50--0.54), indicating that downstream planning modules compress heterogeneous visual representations into a more shared decision-relevant space.
  • Figure 3: Overview of our dual-branch RecogDrive system and analysis points. A VLM branch (as in the original RecogDrive) and a vision-only branch (ViT/ResNet/EVA-CLIP) provide alternative visual representations to a diffusion Transformer planner (DiT) and action decoder. The two branches use the same planner architecture but are instantiated as separate policies (no weight sharing), producing two candidate trajectories with distinct behaviors. We expand candidates by interpolating between the two trajectories and select the final output using a learned scorer. Our analyses (e.g., Fig. \ref{['fig:aligned_feature_kde']} and Fig. \ref{['fig:cka_backbone_vs_dit']}) probe representations at the backbone feature and the DiT (decision) feature indicated in the figure.
  • Figure 4: DualDriveVLA accuracy--compute trade-off by varying the confidence threshold $\gamma$. The x-axis shows the fraction of scenarios routed to the fast path (ViT-only). The left y-axis reports the overall PDMS score, and the right y-axis reports inference speed (throughput / latency, as measured in our setup). Higher ViT selection ratio increases speed, while the scorer-based fallback preserves performance by invoking the slow VLM+selection path on low-confidence cases.
  • Figure 5: CCA canonical-correlation spectra (PCA truncation + whitening.
  • ...and 3 more figures