Table of Contents
Fetching ...

Gramformer: Learning Crowd Counting via Graph-Modulated Transformer

Hui Lin, Zhiheng Ma, Xiaopeng Hong, Qinnan Shangguan, Deyu Meng

TL;DR

Gramformer addresses the homogenized attention problem in transformer-based crowd counting by introducing two graph-based modulations: an attention graph via Edge Weight Regression to diversify attention in an anti-similarity manner, and a feature-based centrality encoding graph to inject node centrality information into input features. The method jointly modulates both attention and node features, with a static attention graph and a dynamic centrality embedding bank that adapt per layer. Empirical results on four large crowd datasets show that Gramformer achieves competitive to state-of-the-art performance, particularly excelling on dense scenes, and ablations confirm the contributions of EWR, centrality encoding, and edge regularization. This graph-modulated transformer framework offers a practical pathway to enhance vision transformers for tasks with highly similar patches and structured scene geometry, with potential applicability beyond crowd counting.

Abstract

Transformer has been popular in recent crowd counting work since it breaks the limited receptive field of traditional CNNs. However, since crowd images always contain a large number of similar patches, the self-attention mechanism in Transformer tends to find a homogenized solution where the attention maps of almost all patches are identical. In this paper, we address this problem by proposing Gramformer: a graph-modulated transformer to enhance the network by adjusting the attention and input node features respectively on the basis of two different types of graphs. Firstly, an attention graph is proposed to diverse attention maps to attend to complementary information. The graph is building upon the dissimilarities between patches, modulating the attention in an anti-similarity fashion. Secondly, a feature-based centrality encoding is proposed to discover the centrality positions or importance of nodes. We encode them with a proposed centrality indices scheme to modulate the node features and similarity relationships. Extensive experiments on four challenging crowd counting datasets have validated the competitiveness of the proposed method. Code is available at {https://github.com/LoraLinH/Gramformer}.

Gramformer: Learning Crowd Counting via Graph-Modulated Transformer

TL;DR

Gramformer addresses the homogenized attention problem in transformer-based crowd counting by introducing two graph-based modulations: an attention graph via Edge Weight Regression to diversify attention in an anti-similarity manner, and a feature-based centrality encoding graph to inject node centrality information into input features. The method jointly modulates both attention and node features, with a static attention graph and a dynamic centrality embedding bank that adapt per layer. Empirical results on four large crowd datasets show that Gramformer achieves competitive to state-of-the-art performance, particularly excelling on dense scenes, and ablations confirm the contributions of EWR, centrality encoding, and edge regularization. This graph-modulated transformer framework offers a practical pathway to enhance vision transformers for tasks with highly similar patches and structured scene geometry, with potential applicability beyond crowd counting.

Abstract

Transformer has been popular in recent crowd counting work since it breaks the limited receptive field of traditional CNNs. However, since crowd images always contain a large number of similar patches, the self-attention mechanism in Transformer tends to find a homogenized solution where the attention maps of almost all patches are identical. In this paper, we address this problem by proposing Gramformer: a graph-modulated transformer to enhance the network by adjusting the attention and input node features respectively on the basis of two different types of graphs. Firstly, an attention graph is proposed to diverse attention maps to attend to complementary information. The graph is building upon the dissimilarities between patches, modulating the attention in an anti-similarity fashion. Secondly, a feature-based centrality encoding is proposed to discover the centrality positions or importance of nodes. We encode them with a proposed centrality indices scheme to modulate the node features and similarity relationships. Extensive experiments on four challenging crowd counting datasets have validated the competitiveness of the proposed method. Code is available at {https://github.com/LoraLinH/Gramformer}.
Paper Structure (22 sections, 11 equations, 5 figures, 10 tables)

This paper contains 22 sections, 11 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Visualization comparison of attention maps in different patches between vanilla attention and the proposed graph-guided attention. Each white cross mark indicates the location of a patch which the attention map is corresponding to. The vanilla attention finds a homogenized solution where the attention maps of most patches are similar (to the final density map), regardless of whether they are background (the first column) or foreground (the last two columns).
  • Figure 2: The framework of Gramformer, which contains two main parts: an attention graph to modulate the attention mechanism by its edge weight, and a feature-based centrality encoding graph to encode the centrality or importance of each node. In the attention graph, different colors represent different semantic values predicted by EWR, and the color difference corresponds to the strength of connecting edges. In centrality encoding, each node is assigned a centrality index, which is linked to its in-degree. The centrality index is used to find the corresponding embedding from a learnable bank to modulate node features.
  • Figure 3: Th overall structure of graph transformer baseline. The edge is constructed according to the nearest neighbor similarities and the weight, which is obtained by encoding the features of two endpoint nodes, will be added to the attention before softmax.
  • Figure 4: Visualizations of the selected nearest neighbors for nodes in the neighboring graph. The red box represents the target node, while the other yellow boxes represent the nodes whose features are most similar to the target feature. The first row represents the neighboring relations in the initial state, while the second row represents the neighboring relations after the transformer update.
  • Figure 5: Visualizations on UCF-QNRF. The second row contains density maps predicted by the vanilla transformer while the third row contains density maps predicted by our proposed Gramformer. The vanilla transformer predicts blurred densities (the first example) in the crowd with smaller scales, while our method still generates clear density points. For the last two examples, our model avoids false alarms in the background predictions.