Table of Contents
Fetching ...

Multi-modal Crowd Counting via a Broker Modality

Haoliang Meng, Xiaopeng Hong, Chenhao Wang, Miao Shang, Wangmeng Zuo

TL;DR

The paper tackles RGB-thermal crowd counting by bridging the modality gap with a broker modality, reframing the problem as triple-modal learning. It introduces a lightweight Broker Modality Generator (BMG) that distills diffusion-based fusion capabilities via a distillation-then-finetuning two-stage training, producing a broker image F = g(R,T) that harmonizes RGB and thermal features. By integrating BMG with a shared feature extractor and a regression head, the method achieves state-of-the-art results on RGB-T and RGB-D datasets while using only ~4M additional parameters, and it analyzes and mitigates ghosting from misalignment. The approach demonstrates robust cross-modal fusion, improved counting accuracy, and practical applicability, with code and models released for public use.

Abstract

Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models. Additionally, we identify and address the ghosting effect caused by direct cross-modal image fusion in multi-modal crowd counting. Through extensive experimental evaluations on popular multi-modal crowd-counting datasets, we demonstrate the effectiveness of our method, which introduces only 4 million additional parameters, yet achieves promising results. The code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting.

Multi-modal Crowd Counting via a Broker Modality

TL;DR

The paper tackles RGB-thermal crowd counting by bridging the modality gap with a broker modality, reframing the problem as triple-modal learning. It introduces a lightweight Broker Modality Generator (BMG) that distills diffusion-based fusion capabilities via a distillation-then-finetuning two-stage training, producing a broker image F = g(R,T) that harmonizes RGB and thermal features. By integrating BMG with a shared feature extractor and a regression head, the method achieves state-of-the-art results on RGB-T and RGB-D datasets while using only ~4M additional parameters, and it analyzes and mitigates ghosting from misalignment. The approach demonstrates robust cross-modal fusion, improved counting accuracy, and practical applicability, with code and models released for public use.

Abstract

Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models. Additionally, we identify and address the ghosting effect caused by direct cross-modal image fusion in multi-modal crowd counting. Through extensive experimental evaluations on popular multi-modal crowd-counting datasets, we demonstrate the effectiveness of our method, which introduces only 4 million additional parameters, yet achieves promising results. The code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting.
Paper Structure (20 sections, 8 equations, 9 figures, 8 tables)

This paper contains 20 sections, 8 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Illustration of the framework of our method. By leveraging the Broker Modal Generator (BMG) to introduce an auxiliary broker modality, we frame the dual-modal visual-thermal crowd counting task as a triple-modal learning problem.
  • Figure 1: Registration results of natural scenery images by OpenCV ORB approachrublee2011orb (a) and CAO-C2Fjiang2020contour (b). Feature points are marked with dots and matched feature points are connected using line segments. For clarity, we only indicate the pairing points with the top 1% highest matching confidence points. Registration algorithms match feature points accurately on natural scenery images.
  • Figure 2: The framework of our broker modal generator. Specifically, the generator consists of a cross-modal contracting module and a modal fused decoding module for image reconstruction, and a cross-modal attention module to enhance modality alignment.
  • Figure 2: Registration results of crowd images from the testing set of RGBT-CC liu2021cross by OpenCV ORB approach rublee2011orb(a) and CAO-C2F jiang2020contour (b). Feature points are marked with dots and matched feature points are connected using line segments. For clarity, we only indicate the pairing points with the top 1% highest matching confidence points. Registration algorithms perform inadequately on crowd images.
  • Figure 3: Illustration of the distillation-then-finetuning strategy. In the distillation stage, BMG is initialized using a distillation process guided by DDFM. In the fine-tuning stage, BMG is tuned together with the feature extractor to suit the counting task.
  • ...and 4 more figures