Table of Contents
Fetching ...

Enhanced Parking Perception by Multi-Task Fisheye Cross-view Transformers

Antonyo Musabini, Ivan Novikov, Sana Soula, Christel Leonet, Lihao Wang, Rachid Benmokhtar, Fabian Burger, Thomas Boulay, Xavier Perrotton

TL;DR

This paper tackles the need for holistic parking perception suitable for end-user HMIs by moving beyond slot detection to jointly localize and orient parked vehicles within a $25\\mathrm{m}\\times25\\mathrm{m}$ BEV. It introduces Multi-Task Fisheye Cross View Transformers (MT-FCVT) that fuse four fisheye surround-view images using cross-view attention to produce a BEV grid, followed by segmentation and Polygon-Yolo-based decoding with a corner-visibility flag. Trained on LiDAR-groundtruth data, the approach achieves approximately $20\\mathrm{cm}$ localization error and an F1-score around $0.89$ for the larger model, with a smaller variant running at about 16 FPS on the Nvidia Jetson Orin and robust generalization across unseen vehicles and camera rigs. The work advances practical parking perception by delivering complete scene understanding, including parking-slot entry lines and vehicle orientation, enabling more capable HMIs and autonomous parking capabilities.

Abstract

Current parking area perception algorithms primarily focus on detecting vacant slots within a limited range, relying on error-prone homographic projection for both labeling and inference. However, recent advancements in Advanced Driver Assistance System (ADAS) require interaction with end-users through comprehensive and intelligent Human-Machine Interfaces (HMIs). These interfaces should present a complete perception of the parking area going from distinguishing vacant slots' entry lines to the orientation of other parked vehicles. This paper introduces Multi-Task Fisheye Cross View Transformers (MT F-CVT), which leverages features from a four-camera fisheye Surround-view Camera System (SVCS) with multihead attentions to create a detailed Bird-Eye View (BEV) grid feature map. Features are processed by both a segmentation decoder and a Polygon-Yolo based object detection decoder for parking slots and vehicles. Trained on data labeled using LiDAR, MT F-CVT positions objects within a 25m x 25m real open-road scenes with an average error of only 20 cm. Our larger model achieves an F-1 score of 0.89. Moreover the smaller model operates at 16 fps on an Nvidia Jetson Orin embedded board, with similar detection results to the larger one. MT F-CVT demonstrates robust generalization capability across different vehicles and camera rig configurations. A demo video from an unseen vehicle and camera rig is available at: https://streamable.com/jjw54x.

Enhanced Parking Perception by Multi-Task Fisheye Cross-view Transformers

TL;DR

This paper tackles the need for holistic parking perception suitable for end-user HMIs by moving beyond slot detection to jointly localize and orient parked vehicles within a BEV. It introduces Multi-Task Fisheye Cross View Transformers (MT-FCVT) that fuse four fisheye surround-view images using cross-view attention to produce a BEV grid, followed by segmentation and Polygon-Yolo-based decoding with a corner-visibility flag. Trained on LiDAR-groundtruth data, the approach achieves approximately localization error and an F1-score around for the larger model, with a smaller variant running at about 16 FPS on the Nvidia Jetson Orin and robust generalization across unseen vehicles and camera rigs. The work advances practical parking perception by delivering complete scene understanding, including parking-slot entry lines and vehicle orientation, enabling more capable HMIs and autonomous parking capabilities.

Abstract

Current parking area perception algorithms primarily focus on detecting vacant slots within a limited range, relying on error-prone homographic projection for both labeling and inference. However, recent advancements in Advanced Driver Assistance System (ADAS) require interaction with end-users through comprehensive and intelligent Human-Machine Interfaces (HMIs). These interfaces should present a complete perception of the parking area going from distinguishing vacant slots' entry lines to the orientation of other parked vehicles. This paper introduces Multi-Task Fisheye Cross View Transformers (MT F-CVT), which leverages features from a four-camera fisheye Surround-view Camera System (SVCS) with multihead attentions to create a detailed Bird-Eye View (BEV) grid feature map. Features are processed by both a segmentation decoder and a Polygon-Yolo based object detection decoder for parking slots and vehicles. Trained on data labeled using LiDAR, MT F-CVT positions objects within a 25m x 25m real open-road scenes with an average error of only 20 cm. Our larger model achieves an F-1 score of 0.89. Moreover the smaller model operates at 16 fps on an Nvidia Jetson Orin embedded board, with similar detection results to the larger one. MT F-CVT demonstrates robust generalization capability across different vehicles and camera rig configurations. A demo video from an unseen vehicle and camera rig is available at: https://streamable.com/jjw54x.
Paper Structure (15 sections, 1 equation, 4 figures, 3 tables)

This paper contains 15 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Left: Four fisheye images from the surround view camera system. Right: Vacant parking slots and vehicles.
  • Figure 2: Global architecture. On the left-upper part; input images, the features extractor and its output dimantions are visible. In the left-middle, the fisheye-aware positional embedding and their down-scaled shapes are illustrated. On the left-bottom, position-aware map embedding, the multi-head cross-view transformers and the bottlenecks are shown. On the right side, based on the BEV features, the two independent task heads and their respective outputs are depicted. Purple circles represents layers.
  • Figure 3: Projection encoders. a) Pinhole cameras, as flat surfaces, in respect of their . b) Fisheye cameras, in respect to lens's radial distortion (front, left, rear, right).
  • Figure 4: Qualitative results. Inference examples where the four fisheye images (on left), the predictions (on middle) and the annotated labels (on right) are visible. For labels, the dark gray zones represent the segmentation maps of parking areas (the segmentation map for vehicles is not illustrated for convenience). The polygon labels for both vehicles and parking areas are also visible. The red dots indicate their center positions, the green dots correspond to their heading directions (entry line for parking areas), and the blue dots mark their rear sides.