Table of Contents
Fetching ...

Rethinking Global Context in Crowd Counting

Guolei Sun, Yun Liu, Thomas Probst, Danda Pani Paudel, Nikola Popovic, Luc Van Gool

TL;DR

Extensive experiments on various datasets, including ShanghaiTech, UCF-QNRF, JHU-CROWD++ and NWPU, demonstrate that the proposed context extraction techniques can significantly improve the performance over the baselines.

Abstract

This paper investigates the role of global context for crowd counting. Specifically, a pure transformer is used to extract features with global information from overlapping image patches. Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches throughout transformer layers. Due to the fact that transformers do not explicitly model the tried-and-true channel-wise interactions, we propose a token-attention module (TAM) to recalibrate encoded features through channel-wise attention informed by the context token. Beyond that, it is adopted to predict the total person count of the image through regression-token module (RTM). Extensive experiments on various datasets, including ShanghaiTech, UCF-QNRF, JHU-CROWD++ and NWPU, demonstrate that the proposed context extraction techniques can significantly improve the performance over the baselines.

Rethinking Global Context in Crowd Counting

TL;DR

Extensive experiments on various datasets, including ShanghaiTech, UCF-QNRF, JHU-CROWD++ and NWPU, demonstrate that the proposed context extraction techniques can significantly improve the performance over the baselines.

Abstract

This paper investigates the role of global context for crowd counting. Specifically, a pure transformer is used to extract features with global information from overlapping image patches. Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches throughout transformer layers. Due to the fact that transformers do not explicitly model the tried-and-true channel-wise interactions, we propose a token-attention module (TAM) to recalibrate encoded features through channel-wise attention informed by the context token. Beyond that, it is adopted to predict the total person count of the image through regression-token module (RTM). Extensive experiments on various datasets, including ShanghaiTech, UCF-QNRF, JHU-CROWD++ and NWPU, demonstrate that the proposed context extraction techniques can significantly improve the performance over the baselines.

Paper Structure

This paper contains 16 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Network Overview. The input image is first split into overlapping patches. Then, those patches go through tokens reduction block and main transformer to learn features with global information. To abstract global information, context token (blue vector) is added to the input sequence before the main transformer. The encoded features are processed by TAM and regression-token module (RTM). The small decoder after TAM is not shown for simplicity.
  • Figure 2: Comparison between SE block hu2018squeeze and TAM. Different from SE block which obtains global information from input features, TAM adopts context token feature to provide channel relations.
  • Figure 3: The structure of RTM.
  • Figure 4: Density Map Visualization. We compare the ground-truth density map, predicted density map from DM-count wang2020DMCount and the proposed method. Our approach produces better density map for both dense and sparse regions, leading to more accurate count predictions.
  • Figure 5: Failure Cases. The failure cases are caused by the low contrast or low quality of the input images.