Table of Contents
Fetching ...

Multi-modal Crowd Counting via Modal Emulation

Chenhao Wang, Xiaopeng Hong, Zhiheng Ma, Yupeng Wei, Yabin Wang, Xiaopeng Fan

TL;DR

The paper addresses multi-modal crowd counting by introducing a modal emulation-based two-pass framework that jointly fuses and emulates modalities. It consists of a Multi-modal Inference (MMI) pass with a Hybrid Cross-modal Attention (HCMA) and a Cross-modal Emulation (CME) pass guided by attention prompting, all trained with a modality alignment loss in addition to Bayesian counting loss. Key contributions include the HCMA for global-local fusion, an attention-prompting CME path to align modalities during training, and a modality alignment loss that bridges semantic gaps, achieving state-of-the-art results on RGB-Thermal and RGB-Depth datasets. The approach demonstrates strong performance and suggests potential applicability to a broad range of multi-modal perception tasks, with CME confined to training to avoid overhead at test time.

Abstract

Multi-modal crowd counting is a crucial task that uses multi-modal cues to estimate the number of people in crowded scenes. To overcome the gap between different modalities, we propose a modal emulation-based two-pass multi-modal crowd-counting framework that enables efficient modal emulation, alignment, and fusion. The framework consists of two key components: a \emph{multi-modal inference} pass and a \emph{cross-modal emulation} pass. The former utilizes a hybrid cross-modal attention module to extract global and local information and achieve efficient multi-modal fusion. The latter uses attention prompting to coordinate different modalities and enhance multi-modal alignment. We also introduce a modality alignment module that uses an efficient modal consistency loss to align the outputs of the two passes and bridge the semantic gap between modalities. Extensive experiments on both RGB-Thermal and RGB-Depth counting datasets demonstrate its superior performance compared to previous methods. Code available at https://github.com/Mr-Monday/Multi-modal-Crowd-Counting-via-Modal-Emulation.

Multi-modal Crowd Counting via Modal Emulation

TL;DR

The paper addresses multi-modal crowd counting by introducing a modal emulation-based two-pass framework that jointly fuses and emulates modalities. It consists of a Multi-modal Inference (MMI) pass with a Hybrid Cross-modal Attention (HCMA) and a Cross-modal Emulation (CME) pass guided by attention prompting, all trained with a modality alignment loss in addition to Bayesian counting loss. Key contributions include the HCMA for global-local fusion, an attention-prompting CME path to align modalities during training, and a modality alignment loss that bridges semantic gaps, achieving state-of-the-art results on RGB-Thermal and RGB-Depth datasets. The approach demonstrates strong performance and suggests potential applicability to a broad range of multi-modal perception tasks, with CME confined to training to avoid overhead at test time.

Abstract

Multi-modal crowd counting is a crucial task that uses multi-modal cues to estimate the number of people in crowded scenes. To overcome the gap between different modalities, we propose a modal emulation-based two-pass multi-modal crowd-counting framework that enables efficient modal emulation, alignment, and fusion. The framework consists of two key components: a \emph{multi-modal inference} pass and a \emph{cross-modal emulation} pass. The former utilizes a hybrid cross-modal attention module to extract global and local information and achieve efficient multi-modal fusion. The latter uses attention prompting to coordinate different modalities and enhance multi-modal alignment. We also introduce a modality alignment module that uses an efficient modal consistency loss to align the outputs of the two passes and bridge the semantic gap between modalities. Extensive experiments on both RGB-Thermal and RGB-Depth counting datasets demonstrate its superior performance compared to previous methods. Code available at https://github.com/Mr-Monday/Multi-modal-Crowd-Counting-via-Modal-Emulation.
Paper Structure (13 sections, 10 equations, 4 figures, 5 tables)

This paper contains 13 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of the proposed framework. Specifically, our framework consists of two passes: the Multi-modal Inference (MMI) pass and the Cross-modal Emulation (CME) pass. The MMI pass uses a hybrid cross-modal attention module to fuse global and local modalities. The CME pass shares the structure and weights with the MMI but emulates features of one modality into another, i.e., $F_r \rightarrow \bar{F_t}$ and $F_t\rightarrow \bar{F_r}$, using an additional attention prompting module. The process of emulation fosters the coordination of different modalities. Moreover, a loss function for modality alignment is employed to bridge the semantic gap that exists between these modalities.
  • Figure 2: Architecture of the Hybrid Cross-Modal Attention Module. (a) Straight Cross-modal Attention is used for global multi-modal fusion. (b) Modulated Cross-modal Attention is used to fuse local details, where the $\odot$, $\sigma$ and ⓒ denote Hadamard product, Sigmoid function, and Concatenation operation, respectively.
  • Figure 3: Visualization results for generating crowd density maps with different models.
  • Figure 4: Distribution of the relative $L1$ distances between the real and pseudo features.