Table of Contents
Fetching ...

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

Fatemeh Naeinian, Ali Hamza, Haoran Zhu, Anna Choromanska

Abstract

End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real domain shifts when generalizing to new locations. In this work we investigate zero-shot cross-city generalization in end-to-end trajectory planning and ask whether self-supervised visual representations improve transfer across cities. We conduct a comprehensive study by integrating self-supervised backbones (I-JEPA, DINOv2, and MAE) into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models relying on traditional supervised backbones across cities with different road topologies and driving conventions, particularly when transferring from right-side to left-side driving environments. Self-supervised representation learning reduces this gap. In open-loop evaluation, a supervised backbone exhibits severe inflation when transferring from Boston to Singapore (L2 displacement ratio 9.77x, collision ratio 19.43x), whereas domain-specific self-supervised pretraining reduces this to 1.20x and 0.75x respectively. In closed-loop evaluation, self-supervised pretraining improves PDMS by up to 4 percent for all single-city training cities. These results show that representation learning strongly influences the robustness of cross-city planning and establish zero-shot geographic transfer as a necessary test for evaluating end-to-end autonomous driving systems.

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

Abstract

End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real domain shifts when generalizing to new locations. In this work we investigate zero-shot cross-city generalization in end-to-end trajectory planning and ask whether self-supervised visual representations improve transfer across cities. We conduct a comprehensive study by integrating self-supervised backbones (I-JEPA, DINOv2, and MAE) into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models relying on traditional supervised backbones across cities with different road topologies and driving conventions, particularly when transferring from right-side to left-side driving environments. Self-supervised representation learning reduces this gap. In open-loop evaluation, a supervised backbone exhibits severe inflation when transferring from Boston to Singapore (L2 displacement ratio 9.77x, collision ratio 19.43x), whereas domain-specific self-supervised pretraining reduces this to 1.20x and 0.75x respectively. In closed-loop evaluation, self-supervised pretraining improves PDMS by up to 4 percent for all single-city training cities. These results show that representation learning strongly influences the robustness of cross-city planning and establish zero-shot geographic transfer as a necessary test for evaluating end-to-end autonomous driving systems.
Paper Structure (40 sections, 3 equations, 6 figures, 8 tables)

This paper contains 40 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Cross-city transfer protocol. A model trained on one city is evaluated zero-shot on a different city. Geographic domain shift leads to increased $L2_{\text{avg}}$ (trajectory displacement error) and higher collision rate, reflecting degraded driving performance under transfer.
  • Figure 2: Overview of the proposed evaluation framework. Top: backbone pretraining paradigms including supervised and self-supervised methods (I-JEPA, DINOv2, MAE). Bottom: backbone integration into the LAW latent world and Transfuser model for end-to-end trajectory prediction.
  • Figure 3: In-domain vs. zero-shot cross-city performance on nuScenes. Circles denote in-domain results and triangles denote cross-city transfer. Lines connect each model’s in-domain and cross-domain performance in $L2_{\text{avg}}$ and collision rate. Shorter lines indicate more robust cross-city transfer.
  • Figure 4: Average OOD PDMS by training city on NAVSIM. For each city, TransFuser (left) and Latent TransFuser (right) are shown side by side. Each panel reports the mean PDMS across the three held-out cities.
  • Figure 5: Closed-loop trajectory comparison in NAVSIM (Las Vegas). Ground truth (green) is overlaid with predictions from ResNet-34 (blue), I-JEPA (orange), DINOv2 (purple), and MAE (red) pretrained on nuScenes. All models follow the overall turn structure, with minor differences in lateral alignment and curvature reflecting representation-dependent behavior under cross-city transfer.
  • ...and 1 more figures