Table of Contents
Fetching ...

TransParking: A Dual-Decoder Transformer Framework with Soft Localization for End-to-End Automatic Parking

Hangyu Du, Chee-Meng Chew

TL;DR

This work tackles end-to-end automatic parking using a purely vision-based Transformer with a dual-decoder architecture that predicts $x_t$ and $y_t$ simultaneously. It introduces a Gaussian soft query over the BEV feature map and a Soft Localization head, enabling cross-stream coupling via Dual-Stream Interactive Self-Attention to refine each trajectory step with local BEV context. On a ParkingE2E-origin dataset, the proposed method achieves substantially lower Hausdorff, L2, and Fourier distances (e.g., $0.2156$, $0.06296$, $1.239$) compared with ParkingE2E and TransFuser, signaling improved accuracy and stability. The approach demonstrates robust end-to-end trajectory prediction for parking tasks, though limitations include dataset scale and the absence of real-road tests; future work points to larger datasets, more complex scenarios, and real-vehicle validation, potentially incorporating reinforcement learning to focus attention adaptively.

Abstract

In recent years, fully differentiable end-to-end autonomous driving systems have become a research hotspot in the field of intelligent transportation. Among various research directions, automatic parking is particularly critical as it aims to enable precise vehicle parking in complex environments. In this paper, we present a purely vision-based transformer model for end-to-end automatic parking, trained using expert trajectories. Given camera-captured data as input, the proposed model directly outputs future trajectory coordinates. Experimental results demonstrate that the various errors of our model have decreased by approximately 50% in comparison with the current state-of-the-art end-to-end trajectory prediction algorithm of the same type. Our approach thus provides an effective solution for fully differentiable automatic parking.

TransParking: A Dual-Decoder Transformer Framework with Soft Localization for End-to-End Automatic Parking

TL;DR

This work tackles end-to-end automatic parking using a purely vision-based Transformer with a dual-decoder architecture that predicts and simultaneously. It introduces a Gaussian soft query over the BEV feature map and a Soft Localization head, enabling cross-stream coupling via Dual-Stream Interactive Self-Attention to refine each trajectory step with local BEV context. On a ParkingE2E-origin dataset, the proposed method achieves substantially lower Hausdorff, L2, and Fourier distances (e.g., , , ) compared with ParkingE2E and TransFuser, signaling improved accuracy and stability. The approach demonstrates robust end-to-end trajectory prediction for parking tasks, though limitations include dataset scale and the absence of real-road tests; future work points to larger datasets, more complex scenarios, and real-vehicle validation, potentially incorporating reinforcement learning to focus attention adaptively.

Abstract

In recent years, fully differentiable end-to-end autonomous driving systems have become a research hotspot in the field of intelligent transportation. Among various research directions, automatic parking is particularly critical as it aims to enable precise vehicle parking in complex environments. In this paper, we present a purely vision-based transformer model for end-to-end automatic parking, trained using expert trajectories. Given camera-captured data as input, the proposed model directly outputs future trajectory coordinates. Experimental results demonstrate that the various errors of our model have decreased by approximately 50% in comparison with the current state-of-the-art end-to-end trajectory prediction algorithm of the same type. Our approach thus provides an effective solution for fully differentiable automatic parking.

Paper Structure

This paper contains 15 sections, 13 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overall architecture. Given the user-selected slot and multi-camera images, we form a Gaussian soft query and build BEV features via EfficientNet+LSS. Dual decoders predict $(x_t,y_t)$ with Dual-Stream Interactive Self-Attention, and a Soft Localization head refines each step using local BEV evidence before outputting the final trajectory (used by the controller).
  • Figure 2: The specific implementation of the Encoder structure
  • Figure 3: The implementation logic of the Dual-Stream Interactive Self-Attention structure. The feature vectors of the X and Y coordinates are concatenated at each time step, that is, there are two vectors at each time step. After the attention calculation is completed, it is reshaped to the original size. The Batch dimension is ignored in the picture.
  • Figure 4: The structure of the Soft Localization Coordinate Refinement Head. The BEV feature comes from the result of LSS. The hidden state vectors of the decoders of X and Y jointly predict the attention weight map of the possible area of the next trajectory point, which is multiplied by the BEV feature to generate the BEV feature weight map. The hidden state vectors interact with it through the cross-attention mechanism to achieve fine-tuning.
  • Figure 5: The inference results in indoor and outdoor scenarios. The orange line represents the inference trajectory of the method proposed in this paper, and the purple line represents that of ParkingE2E li2024parkinge2e.
  • ...and 2 more figures