TransParking: A Dual-Decoder Transformer Framework with Soft Localization for End-to-End Automatic Parking
Hangyu Du, Chee-Meng Chew
TL;DR
This work tackles end-to-end automatic parking using a purely vision-based Transformer with a dual-decoder architecture that predicts $x_t$ and $y_t$ simultaneously. It introduces a Gaussian soft query over the BEV feature map and a Soft Localization head, enabling cross-stream coupling via Dual-Stream Interactive Self-Attention to refine each trajectory step with local BEV context. On a ParkingE2E-origin dataset, the proposed method achieves substantially lower Hausdorff, L2, and Fourier distances (e.g., $0.2156$, $0.06296$, $1.239$) compared with ParkingE2E and TransFuser, signaling improved accuracy and stability. The approach demonstrates robust end-to-end trajectory prediction for parking tasks, though limitations include dataset scale and the absence of real-road tests; future work points to larger datasets, more complex scenarios, and real-vehicle validation, potentially incorporating reinforcement learning to focus attention adaptively.
Abstract
In recent years, fully differentiable end-to-end autonomous driving systems have become a research hotspot in the field of intelligent transportation. Among various research directions, automatic parking is particularly critical as it aims to enable precise vehicle parking in complex environments. In this paper, we present a purely vision-based transformer model for end-to-end automatic parking, trained using expert trajectories. Given camera-captured data as input, the proposed model directly outputs future trajectory coordinates. Experimental results demonstrate that the various errors of our model have decreased by approximately 50% in comparison with the current state-of-the-art end-to-end trajectory prediction algorithm of the same type. Our approach thus provides an effective solution for fully differentiable automatic parking.
