Table of Contents
Fetching ...

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Matthew Strong, Wei-Jer Chang, Quentin Herau, Jiezhi Yang, Yihan Hu, Chensheng Peng, Wei Zhan

TL;DR

A feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks, positioning LFG as a compelling video-centric foundation model for autonomous driving.

Abstract

Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

TL;DR

A feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks, positioning LFG as a compelling video-centric foundation model for autonomous driving.

Abstract

Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.
Paper Structure (38 sections, 10 equations, 16 figures, 8 tables)

This paper contains 38 sections, 10 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: LFG learns a unified pseudo-4D representation of geometry, semantics, motion, and short-term future evolution directly from unposed, unlabeled single-view driving videos. A single feedforward encoder processes observed frames and produces temporally consistent predictions of 3D point maps, camera poses, semantic layouts, confidence, and motion masks for both current and future frames.
  • Figure 2: LFG architecture. Starting from unposed single-view driving clips, a pretrained $\pi^3$ backbone encodes $N$ observed frames into latent scene tokens. A lightweight causal autoregressive transformer rolls out $M$ future tokens, which a shared decoder maps to point maps, camera poses, semantic segmentation, confidence maps, and motion masks for all $N{+}M$ frames. Multi-modal teachers provide pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation that transfers effectively to downstream planning.
  • Figure 3: $\pi^3$-to-LFG distillation. We transfer geometric knowledge from the pretrained $\pi^3$ teacher to LFG by supervising point maps, confidence maps, and camera poses for all observed and future frames. While the teacher has access to the full sequence, the student sees only the first $N$ frames and must predict both current and future geometry, enabling LFG to learn temporally consistent scene structure and future ego-motion from partial observations.
  • Figure 4: Semantic distillation. A pretrained SegFormer teacher, trained on Cityscapes, provides soft semantic pseudo-labels for each frame. LFG predicts semantic maps for both observed and future frames using only the first $M$ inputs, learning temporally consistent scene semantics through teacher–student supervision aligned with the model's geometric features.
  • Figure 5: Motion mask generation pipeline. We first detect human and vehicle instances in the first frame using Grounded SAM2, then track their 2D trajectories across time with CoTracker3. Using teacher $\pi^3$ point maps, tracked pixels are backprojected into 3D and per-instance 3D displacements are measured over the sequence. Instances whose motion exceeds a threshold for at least $K_{\min}$ frames are labeled as dynamic, and their masks are rasterized into dense per-pixel motion masks $\mathbf{M}_t$, which supervise the motion head.
  • ...and 11 more figures