Table of Contents
Fetching ...

Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks

Morui Zhu, Yongqi Zhu, Song Fu, Qing Yang

Abstract

Autonomous trucking poses unique challenges due to articulated tractor-trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate evaluation, we introduce STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry across diverse environments. Experiments demonstrate that dCAP achieves stable, accurate perception while addressing the limitations of static calibration in autonomous trucking. The dataset, development kit, and source code will be publicly released.

Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks

Abstract

Autonomous trucking poses unique challenges due to articulated tractor-trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate evaluation, we introduce STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry across diverse environments. Experiments demonstrate that dCAP achieves stable, accurate perception while addressing the limitations of static calibration in autonomous trucking. The dataset, development kit, and source code will be publicly released.
Paper Structure (21 sections, 12 equations, 10 figures, 12 tables)

This paper contains 21 sections, 12 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Overview of dCAP. We perform online 6-DoF articulated pose estimation for tractor--trailer systems and enable articulated-aware perception. Unlike traditional SfM methods (e.g., COLMAP), dCAP succeeds without requiring a valid static initialization pair.
  • Figure 2: Example from the STT4AT dataset showing six synchronized camera views, the LiDAR point cloud with annotated agents, a BEV illustration of trailer articulation.
  • Figure 3: Distribution of annotated frames in the STT4AT dataset. The left chart separates straight and turning maneuvers, while the right details the composition within turning scenarios.
  • Figure 4: Overview of the proposed architecture. Multi-view images at time $t$ are encoded by a frozen VGGT backbone into camera tokens, while the trainable decoder comprises (a) Camera Temporal Self-Attention (CTA) for fusing the historical token $T_{t-1}$ with the current query $Q$, (b) Camera Cross-Attention (CCA) for attending $Q$ to encoder tokens $\{T_i\}_{i=1}^{6}$, (c) an AdaLN-modulated refinement stack with residual Add&Norm applied $L$ times.
  • Figure 5: Representative articulated-driving scenarios in the STT4AT dataset. Each example shows the tractor–trailer configuration and trailer-mounted camera poses under typical high-articulation maneuvers.
  • ...and 5 more figures