Table of Contents
Fetching ...

Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

Qi Zhang, Kaiyi Zhang, Antoni B. Chan, Hui Huang

TL;DR

A novel Mahalanobis distance-based multi-view optimal transport (M-MVOT) loss specifically designed for multi-view crowd localization, which replaces the Euclidean-based transport cost with the Mahalanobis distance, which defines elliptical iso-contours in the cost function whose long-axis and short-axis directions are guided by the view ray direction.

Abstract

Multi-view crowd localization predicts the ground locations of all people in the scene. Typical methods usually estimate the crowd density maps on the ground plane first, and then obtain the crowd locations. However, the performance of existing methods is limited by the ambiguity of the density maps in crowded areas, where local peaks can be smoothed away. To mitigate the weakness of density map supervision, optimal transport-based point supervision methods have been proposed in the single-image crowd localization tasks, but have not been explored for multi-view crowd localization yet. Thus, in this paper, we propose a novel Mahalanobis distance-based multi-view optimal transport (M-MVOT) loss specifically designed for multi-view crowd localization. First, we replace the Euclidean-based transport cost with the Mahalanobis distance, which defines elliptical iso-contours in the cost function whose long-axis and short-axis directions are guided by the view ray direction. Second, the object-to-camera distance in each view is used to adjust the optimal transport cost of each location further, where the wrong predictions far away from the camera are more heavily penalized. Finally, we propose a strategy to consider all the input camera views in the model loss (M-MVOT) by computing the optimal transport cost for each ground-truth point based on its closest camera. Experiments demonstrate the advantage of the proposed method over density map-based or common Euclidean distance-based optimal transport loss on several multi-view crowd localization datasets. Project page: https://vcc.tech/research/2024/MVOT.

Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

TL;DR

A novel Mahalanobis distance-based multi-view optimal transport (M-MVOT) loss specifically designed for multi-view crowd localization, which replaces the Euclidean-based transport cost with the Mahalanobis distance, which defines elliptical iso-contours in the cost function whose long-axis and short-axis directions are guided by the view ray direction.

Abstract

Multi-view crowd localization predicts the ground locations of all people in the scene. Typical methods usually estimate the crowd density maps on the ground plane first, and then obtain the crowd locations. However, the performance of existing methods is limited by the ambiguity of the density maps in crowded areas, where local peaks can be smoothed away. To mitigate the weakness of density map supervision, optimal transport-based point supervision methods have been proposed in the single-image crowd localization tasks, but have not been explored for multi-view crowd localization yet. Thus, in this paper, we propose a novel Mahalanobis distance-based multi-view optimal transport (M-MVOT) loss specifically designed for multi-view crowd localization. First, we replace the Euclidean-based transport cost with the Mahalanobis distance, which defines elliptical iso-contours in the cost function whose long-axis and short-axis directions are guided by the view ray direction. Second, the object-to-camera distance in each view is used to adjust the optimal transport cost of each location further, where the wrong predictions far away from the camera are more heavily penalized. Finally, we propose a strategy to consider all the input camera views in the model loss (M-MVOT) by computing the optimal transport cost for each ground-truth point based on its closest camera. Experiments demonstrate the advantage of the proposed method over density map-based or common Euclidean distance-based optimal transport loss on several multi-view crowd localization datasets. Project page: https://vcc.tech/research/2024/MVOT.
Paper Structure (13 sections, 7 equations, 5 figures, 5 tables)

This paper contains 13 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The comparison of different optimal transport (OT) losses on the ground plane. The original OT uses Euclidean distance cost (E-OT) and treats all deviations from the ground-truth equally. MV-OT uses the view ray direction to the camera to change the cost function, ED-OT considers the camera distance in E-OT, while M-OT considers both the influence of the view ray direction and the distance to the camera.
  • Figure 2: The model architecture and the proposed Mahalanobis distance-based multi-view optimal transport (M-MVOT) loss for multi-view crowd localization. The model consists of feature extraction, projection, and multi-view fusion and decoding. In the proposed M-MVOT, each point's transport cost $\textbf{C}$ is calculated via the Mahalanobis distance instead of the common Euclidean distance under the closest camera, which is directed by the view ray and adjusted by the object-to-camera distance.
  • Figure 3: The comparison of different versions for the multi-view optimal transport, where the cost for each ground-truth point is calculated using the closest camera. E-MVOT considers each point equally regardless of the camera views, which is the same as E-OT for a single camera view. MV-MVOT replaces the Euclidean distance with the Mahalanobis distance and the direction is guided by the view ray direction. ED-MVOT introduces the object-to-camera influence in the optimal transport cost of E-OT. M-MVOT considers both the view-ray direction guidance and the object-to-camera distance adjustment in the optimal transport based on M-OT.
  • Figure 4: The predicted crowd occupancy maps of different methods on the 3 datasets CVCS, MultiviewX, and Wildtrack (zoom in for better view). The value of crowd occupancy maps indicates the person probability of each location.
  • Figure 5: Comparison of MV-MVOT, ED-MVOT, and M-MVOT. The blue triangle is the camera location. M-MVOT predicts fewer artifacts than MV-MVOT and ED-MVOT (see red boxes), demonstrating the effectiveness of distance adjustment and view-ray direction, respectively.