Virtual Classification: Modulating Domain-Specific Knowledge for Multidomain Crowd Counting

Mingyue Guo; Binghui Chen; Zhaoyi Yan; Yaowei Wang; Qixiang Ye

Virtual Classification: Modulating Domain-Specific Knowledge for Multidomain Crowd Counting

Mingyue Guo, Binghui Chen, Zhaoyi Yan, Yaowei Wang, Qixiang Ye

TL;DR

This work tackles domain bias in multidomain crowd counting by introducing MDKNet, which uses Instance-specific Batch Normalization (IsBN) guided by a Domain-guided BN parameterizer and a Domain-guided Virtual Class (DVC). The DVC creates a domain-separable latent space, which informs IsBN to adapt feature propagation for each domain, while virtual classification labels capture overlaps between datasets, enabling dynamic, domain-aware modulation in a single training stage. Empirical results across ShanghaiTech A/B, UCF-QNRF, and NWPU show that MDKNet_vcl consistently outperforms both single-domain baselines and previous multidomain methods, demonstrating strong generalization and robustness. The approach offers a simple, effective pipeline with practical implications for scalable multidomain crowd counting, and the authors provide code for reproducibility.

Abstract

Multidomain crowd counting aims to learn a general model for multiple diverse datasets. However, deep networks prefer modeling distributions of the dominant domains instead of all domains, which is known as domain bias. In this study, we propose a simple-yet-effective Modulating Domain-specific Knowledge Network (MDKNet) to handle the domain bias issue in multidomain crowd counting. MDKNet is achieved by employing the idea of `modulating', enabling deep network balancing and modeling different distributions of diverse datasets with little bias. Specifically, we propose an Instance-specific Batch Normalization (IsBN) module, which serves as a base modulator to refine the information flow to be adaptive to domain distributions. To precisely modulating the domain-specific information, the Domain-guided Virtual Classifier (DVC) is then introduced to learn a domain-separable latent space. This space is employed as an input guidance for the IsBN modulator, such that the mixture distributions of multiple datasets can be well treated. Extensive experiments performed on popular benchmarks, including Shanghai-tech A/B, QNRF and NWPU, validate the superiority of MDKNet in tackling multidomain crowd counting and the effectiveness for multidomain learning. Code is available at \url{https://github.com/csguomy/MDKNet}.

Virtual Classification: Modulating Domain-Specific Knowledge for Multidomain Crowd Counting

TL;DR

Abstract

Paper Structure (37 sections, 7 equations, 6 figures, 9 tables, 2 algorithms)

This paper contains 37 sections, 7 equations, 6 figures, 9 tables, 2 algorithms.

Introduction
Related Work
Single-domain Crowd Counting
Cross-domain Crowd Counting
Multidomain Learning
Methodology
Motivation
The Proposed Framework
Instance-specific Batch Normalization
Domain-guided BN Parameterizer
Density Map Predictor
Baseline Training
Domain Guidance
Ground-truth Classification Label
Training MDKNet$_{gcl}$
...and 22 more sections

Figures (6)

Figure 1: Sample images for crowd counting from ShanghaiTech zhang2016single, UCF-QNRF idrees2018composition, and NWPU wang2020nwpu datasets. It is observed that different datasets have different attributes, $e.g.$, ShanghaiTech A is mainly composed of congested images, QNRF is of highly congested samples and have more background scenarios, NWPU covers a much larger variety of data distributions due to density, perspective, background, etc, while ShanghaiTech B prefers low density and ordinary street-based scenes. As a result, deep model trained by a single dataset cannot generalize well on other unseen datasets due to the distribution differences.
Figure 2: Illustration of domain-guided modulation. A, B, Q, N are the abbreviations of public dataset ShanghaiTech zhang2016single A, ShanghaiTech zhang2016single B, UCF-QNRF idrees2018composition, and NWPU wang2020nwpu, respectively. Domain-guided virtual classes are the generalized ground-truth classes, which support the overlapped domains between datasets. Employing these virtual classes, DVC is applied to optimize a domain-separable latent space, so as to provide a guidance to modulate the features maps by the subsequent Instance-specific Batch Normalization (IsBN). Each dataset is dynamically split to several sub-domains, including a non-overlapped domain and multiple overlapped domains. The overlapped domain between dataset $X$ and dataset $Y$ is denoted as $\mathcal{D}^{X\leftrightarrow Y}$ while the non-overlapped domain of dataset $X$ is represented as $\mathcal{D}^{X \leftrightarrow X}$. $\left\{X\rightarrow Y\right\}$ denotes the set of samples that are originally collected into dataset $X$ yet fallen in domain $\mathcal{D}^{X \leftrightarrow Y}$. Consequently, samples fallen in $\mathcal{D}^{X\leftrightarrow Y}$ come from two sets $\left\{X\rightarrow Y\right\}$ and $\left\{Y\rightarrow X\right\}$.
Figure 3: MDKNet architecture, which consists of a truncated HRNet-W40-C backbone wang2020deep, a IsBN module, a domain-guided BN parameterizer and a density map predictor. The domain classifier encourages the output of domain-guided BN parameterizer be domain-separable via classification loss on ground-truth classification labels or virtual classification labels. The virtual classification labels are adaptive generated online via Alg. \ref{['alg:tlg']}.
Figure 4: The illustration of corrected classification label generation. Without loss of generality, the figure only depicts one image $I$ (we do not use $I_i$ for later annotation simplicity) from dataset $D^{1}$ for description. The one-hot ground-truth label $\overline{y}$ is firstly transformed into the initial virtual classification label $v_{0}$. On the other hand, the encoded feature $\phi(x)$ will be passed into a domain classifier and activated via a sigmoid function to get the predicted virtual label $\hat{v}$. Then $\hat{v}$ will be updated via Alg. \ref{['alg:rpl']} and get $\hat{v}^{*}$. The corrected prediction label $\hat{v}^{*}$ will be accumulated within the epochs of window $win_k$. When the last epoch of window $win_k$ is finished, the accumulated label $\sum_{p\in win_k}\hat{v}^{*}_{p}$ will be averaged and later combined with initial virtual classification label $v_{0}$. Finally, it is activated with a softmax operation to get the $(k+1)$-th generated virtual classification label $v_{k+1}$. Best view in color.
Figure 5: Density maps of test samples predicted by MDKNet trained on the multidomain dataset. (a) input images, (b), (c) and (d) are the density maps predicted by MDKNet$_{base}$, MDKNet$_{gcl}$ and MDKNet$_{vcl}$, respectively. (e) ground-truth density maps. The corresponding ground-truth(GT) and predicted(Est) counts are showed at the down-right corner of each image. Predicted/gt classification labels are also provided. When some element of predicted classification label is lower than $0.005$, then we directly set it to $0.00$. MDKNet$_{vcl}$ achieves the best performance both qualitatively and quantitatively.
...and 1 more figures

Virtual Classification: Modulating Domain-Specific Knowledge for Multidomain Crowd Counting

TL;DR

Abstract

Virtual Classification: Modulating Domain-Specific Knowledge for Multidomain Crowd Counting

Authors

TL;DR

Abstract

Table of Contents

Figures (6)