Table of Contents
Fetching ...

End-to-End Visual Autonomous Parking via Control-Aided Attention

Chao Chen, Shunyu Yao, Yuanwu He, Feng Tao, Ruojing Song, Yuliang Guo, Xinyu Huang, Chenxu Wu, Liu Ren, Chen Feng

TL;DR

This work introduces CAA-Policy, an end-to-end visual autonomous parking framework that tightly couples perception and control through a novel Control-Aided Attention (CAA) mechanism, which uses control-gradient signals to guide attention toward control-relevant regions. It augments the perception backbone with a Target Tokenization Module and a Learnable Motion Prediction module, and adds a short-horizon waypoint prediction task to improve temporal consistency. A unified multi-task loss including a Grad-CAM–inspired CAA loss aligns perception with downstream control, enabling robust performance in CARLA that surpasses both end-to-end and modular baselines, while maintaining interpretability. The results demonstrate substantial gains in trajectory accuracy, target tracking, and failure-rate reduction, highlighting the practical value of integrating target-aware perception, motion reasoning, and attention guidance for precise parking in dynamic environments.

Abstract

Precise parking requires an end-to-end system where perception adaptively provides policy-relevant details - especially in critical areas where fine control decisions are essential. End-to-end learning offers a unified framework by directly mapping sensor inputs to control actions, but existing approaches lack effective synergy between perception and control. Instead, we propose CAA-Policy, an end-to-end imitation learning system that allows control signal to guide the learning of visual attention via a novel Control-Aided Attention (CAA) mechanism. We train such an attention module in a self-supervised manner, using backpropagated gradients from the control outputs instead of from the training loss. This strategy encourages attention to focus on visual features that induce high variance in action outputs, rather than merely minimizing the training loss - a shift we demonstrate leads to a more robust and generalizable policy. To further strengthen the framework, CAA-Policy incorporates short-horizon waypoint prediction as an auxiliary task to improve temporal consistency of control outputs, a learnable motion prediction module to robustly track target slots over time, and a modified target tokenization scheme for more effective feature fusion. Extensive experiments in the CARLA simulator show that CAA-Policy consistently surpasses both the end-to-end learning baseline and the modular BEV segmentation + hybrid A* pipeline, achieving superior accuracy, robustness, and interpretability. Code and Collected Training datasets will be released. Code is released at https://github.com/ai4ce/CAAPolicy.

End-to-End Visual Autonomous Parking via Control-Aided Attention

TL;DR

This work introduces CAA-Policy, an end-to-end visual autonomous parking framework that tightly couples perception and control through a novel Control-Aided Attention (CAA) mechanism, which uses control-gradient signals to guide attention toward control-relevant regions. It augments the perception backbone with a Target Tokenization Module and a Learnable Motion Prediction module, and adds a short-horizon waypoint prediction task to improve temporal consistency. A unified multi-task loss including a Grad-CAM–inspired CAA loss aligns perception with downstream control, enabling robust performance in CARLA that surpasses both end-to-end and modular baselines, while maintaining interpretability. The results demonstrate substantial gains in trajectory accuracy, target tracking, and failure-rate reduction, highlighting the practical value of integrating target-aware perception, motion reasoning, and attention guidance for precise parking in dynamic environments.

Abstract

Precise parking requires an end-to-end system where perception adaptively provides policy-relevant details - especially in critical areas where fine control decisions are essential. End-to-end learning offers a unified framework by directly mapping sensor inputs to control actions, but existing approaches lack effective synergy between perception and control. Instead, we propose CAA-Policy, an end-to-end imitation learning system that allows control signal to guide the learning of visual attention via a novel Control-Aided Attention (CAA) mechanism. We train such an attention module in a self-supervised manner, using backpropagated gradients from the control outputs instead of from the training loss. This strategy encourages attention to focus on visual features that induce high variance in action outputs, rather than merely minimizing the training loss - a shift we demonstrate leads to a more robust and generalizable policy. To further strengthen the framework, CAA-Policy incorporates short-horizon waypoint prediction as an auxiliary task to improve temporal consistency of control outputs, a learnable motion prediction module to robustly track target slots over time, and a modified target tokenization scheme for more effective feature fusion. Extensive experiments in the CARLA simulator show that CAA-Policy consistently surpasses both the end-to-end learning baseline and the modular BEV segmentation + hybrid A* pipeline, achieving superior accuracy, robustness, and interpretability. Code and Collected Training datasets will be released. Code is released at https://github.com/ai4ce/CAAPolicy.

Paper Structure

This paper contains 35 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 2: Decoder self-attention maps for target tracking at two consecutive frames ($t$ and $t{+}1$). (a) Without CAA, self-attention is scattered over irrelevant areas; (b) With modified CBAM woo2018cbam, an automatically learned attention module from task supervision (no auxiliary loss), attention is partially focused; (c) With CAA, self-attention is concentrated on the target and its vicinity; (d) Ground-truth segmentation (yellow: target, blue: obstacles) for reference.
  • Figure 3: CAA-Policy consists of five main components: (1) Perception Backbone, (2) Feature Fusion, (3) Learnable motion prediction module, (4) Control-Aided Attention (CAA) module and (5) Control and Waypoint Prediction. The network also incorporates auxiliary heads for depth estimation and semantic segmentation, following the design of E2EParking yang2024e2e, to improve feature representation. Notably, the Learnable motion prediction module is only used during inference, leveraging historical control and vehicle state information to reason about ego-vehicle dynamics and target position.
  • Figure 4: Failure case study across different baselines and our method. Each row corresponds to a different experiment: (1) reproduced E2EParking trained with limited data, (2) Modular Hybrid A* baseline, (3) original E2EParking checkpoint trained with large-scale data and longer schedule, (4) CAA-Policy (Ours). Each column shows: (a) distribution of key metrics (TSR, TFR, NTSR, CR, TR) and minor errors (e.g., out-of-bound) as pie charts. (b-d) the top three representative failure modes observed in that experiment. This layout allows direct comparison of overall performance and failure types across baselines and our method. Note that Trajectories are color-coded according to time sequence. For the Modular Hybrid A* baseline, red trajectories indicate the planned GT trajectory, which the controller struggles to follow on consecutive turns.