Table of Contents
Fetching ...

GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition

Ruijie Yao, Sheng Jin, Lumin Xu, Wang Zeng, Wentao Liu, Chen Qian, Ping Luo, Ji Wu

TL;DR

GKGNet tackles MLIR by unifying image patches and label embeddings in a single graph and enabling dynamic, multi-perspective message passing through Group KGCN. It introduces cross-level and patch-level graphs, with Group KNN and group max-relative convolution to adapt connectivity to object scale and layout, mitigating background interference. Empirically, it achieves state-of-the-art results on MS-COCO and VOC2007 with lower computational costs, and ablations confirm the contributions of Patch-Level, Cross-Level, and Group KNN components. The work demonstrates the practical impact of dynamic graph construction for MLIR and suggests extensions to broader graph-based learning problems.

Abstract

Multi-Label Image Recognition (MLIR) is a challenging task that aims to predict multiple object labels in a single image while modeling the complex relationships between labels and image regions. Although convolutional neural networks and vision transformers have succeeded in processing images as regular grids of pixels or patches, these representations are sub-optimal for capturing irregular and discontinuous regions of interest. In this work, we present the first fully graph convolutional model, Group K-nearest neighbor based Graph convolutional Network (GKGNet), which models the connections between semantic label embeddings and image patches in a flexible and unified graph structure. To address the scale variance of different objects and to capture information from multiple perspectives, we propose the Group KGCN module for dynamic graph construction and message passing. Our experiments demonstrate that GKGNet achieves state-of-the-art performance with significantly lower computational costs on the challenging multi-label datasets, i.e., MS-COCO and VOC2007 datasets. Codes are available at https://github.com/jin-s13/GKGNet.

GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition

TL;DR

GKGNet tackles MLIR by unifying image patches and label embeddings in a single graph and enabling dynamic, multi-perspective message passing through Group KGCN. It introduces cross-level and patch-level graphs, with Group KNN and group max-relative convolution to adapt connectivity to object scale and layout, mitigating background interference. Empirically, it achieves state-of-the-art results on MS-COCO and VOC2007 with lower computational costs, and ablations confirm the contributions of Patch-Level, Cross-Level, and Group KNN components. The work demonstrates the practical impact of dynamic graph construction for MLIR and suggests extensions to broader graph-based learning problems.

Abstract

Multi-Label Image Recognition (MLIR) is a challenging task that aims to predict multiple object labels in a single image while modeling the complex relationships between labels and image regions. Although convolutional neural networks and vision transformers have succeeded in processing images as regular grids of pixels or patches, these representations are sub-optimal for capturing irregular and discontinuous regions of interest. In this work, we present the first fully graph convolutional model, Group K-nearest neighbor based Graph convolutional Network (GKGNet), which models the connections between semantic label embeddings and image patches in a flexible and unified graph structure. To address the scale variance of different objects and to capture information from multiple perspectives, we propose the Group KGCN module for dynamic graph construction and message passing. Our experiments demonstrate that GKGNet achieves state-of-the-art performance with significantly lower computational costs on the challenging multi-label datasets, i.e., MS-COCO and VOC2007 datasets. Codes are available at https://github.com/jin-s13/GKGNet.
Paper Structure (35 sections, 3 equations, 10 figures, 8 tables)

This paper contains 35 sections, 3 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Illustration of feature extraction in CNN, vision transformer, and graph convolutional network (GCN). (a) CNN excels at processing continuous regions but struggles with irregular regions of interest. (b) Vision transformers handle complex regions of interest but introduce redundant interference from the background. (c) GCN constructs connections between the destination node and multiple objects of interest distributed in different spatial locations.
  • Figure 2: Overview of GKGNet. GKGNet splits the input image into a set of patch nodes, and regards the learnable label embeddings as label nodes. Four-stage network is applied to process the patch nodes and label nodes in the unified graph structure. The number of patch nodes is reduced after each stage to extract multi-scale visual features. At each stage, the patch nodes are first updated via Patch-Level Group KGCN modules, and then Cross-Level Group KGCN modules updates the label nodes by building the connections between target labels and image regions of interest. The output patch nodes and label nodes of the last stage are combined for multi-label prediction.
  • Figure 3: Illustration of Group KGCN. (a) Traditional KNN based graph construction (K=2). (b) Group KNN based graph construction (G=2, K=2). The blue check marks indicate the source nodes are selected. (c) Structure of Group KGCN module.
  • Figure 4: Effect of the number of groups $G$ (Left) and number of neighbors $K$ (Right).
  • Figure 5: Visualization of the learned connections between label node and patch nodes in the Cross-Level Group KGCN module. The colored blocks indicate that the patches are connected to the label "bottle", "cup", or "car".
  • ...and 5 more figures