Table of Contents
Fetching ...

StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention

Zhongrui Yu, Zhao Wang, Yijia Xie, Yida Wang, Xueyang Zhang, Yifei Zhan, Kun Zhan

Abstract

Feedforward reconstruction is crucial for autonomous driving applications, where rapid scene reconstruction enables efficient utilization of large-scale driving datasets in closed-loop simulation and other downstream tasks, eliminating the need for time-consuming per-scene optimization. We present StreetForward, a pose-free and tracker-free feedforward framework for dynamic street reconstruction. Building upon the alternating attention mechanism from Visual Geometry Grounded Transformer (VGGT), we propose a simple yet effective temporal mask attention module that captures dynamic motion information from image sequences and produces motion-aware latent representations. Static content and dynamic instances are represented uniformly with 3D Gaussian Splatting, and are optimized jointly by cross-frame rendering with spatio-temporal consistency, allowing the model to infer per-pixel velocities and produce high-fidelity novel views at new poses and times. We train and evaluate our model on the Waymo Open Dataset, demonstrating superior performance on novel view synthesis and depth estimation compared to existing methods. Furthermore, zero-shot inference on CARLA and other datasets validates the generalization capability of our approach. More visualizations are available on our project page: https://streetforward.github.io.

StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention

Abstract

Feedforward reconstruction is crucial for autonomous driving applications, where rapid scene reconstruction enables efficient utilization of large-scale driving datasets in closed-loop simulation and other downstream tasks, eliminating the need for time-consuming per-scene optimization. We present StreetForward, a pose-free and tracker-free feedforward framework for dynamic street reconstruction. Building upon the alternating attention mechanism from Visual Geometry Grounded Transformer (VGGT), we propose a simple yet effective temporal mask attention module that captures dynamic motion information from image sequences and produces motion-aware latent representations. Static content and dynamic instances are represented uniformly with 3D Gaussian Splatting, and are optimized jointly by cross-frame rendering with spatio-temporal consistency, allowing the model to infer per-pixel velocities and produce high-fidelity novel views at new poses and times. We train and evaluate our model on the Waymo Open Dataset, demonstrating superior performance on novel view synthesis and depth estimation compared to existing methods. Furthermore, zero-shot inference on CARLA and other datasets validates the generalization capability of our approach. More visualizations are available on our project page: https://streetforward.github.io.
Paper Structure (29 sections, 21 equations, 8 figures, 6 tables)

This paper contains 29 sections, 21 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The proposed StreetForwardpipeline. We illustrate two common types of dynamic scenes with rigid objects(vehicles) and deformable objects(pedestrians) in and . The input video is first encoded into per-frame patchified features and then processed by $L$ times alternating global- and frame-attention to aggregate information across frames. These aggregated features are directly decoded by a camera head, a depth head and a Gaussian Head to obtain poses, depth and Gaussian attributes. Then causal masked attention is introduce to form motion-aware features, which are used to estimate both forward and backward motion as well as dynamic mask for separating static and dynamic Gaussians. The final 4D scene is obtained by combining static Gaussians with dynamic Gaussians propagated across time using the predicted motion.
  • Figure 2: Robustness to false-positive motion-mask prompts. Our method correctly assigns near-zero dynamic probability to parked or slowly moving vehicles, even if they are labeled as dynamic in annotations. This accurate understanding of motion (c) eliminates the motion ambiguity seen in DGGT (a), resulting in stable and clear rendering of the target vehicle over time (d).
  • Figure 2: Waymo original-view (intrapolation) synthesis in PSNR and SSIM. Best and second-best per column are indicated by color boxes. Values for STORM and DGGT are transcribed from the cited papers.
  • Figure 3: Enforcing rigid regularization ($\mathcal{L}_\text{rigid}$) removes the structural floaters seen around rigid objects in (a).
  • Figure 4: Temporal interpolation. With observation at time t-1 and t+1, we manage to correctly synthesize at time t. The figure compares the results of our full model against an ablation without backward fusion by predicting forward velocity only, and DGGT chen2025dggt. The left three columns show pedestrian is with correct poses, and the right three columns display vehicle geometry reconstruction, where our model delivers complete geometries.
  • ...and 3 more figures