Table of Contents
Fetching ...

Effective Clustering on Large Attributed Bipartite Graphs

Renchi Yang, Yidu Wu, Xiaoyang Lin, Qichen Wang, Tsz Nam Chan, Jieming Shi

TL;DR

The paper tackles k-ABGC on large Attributed Bipartite Graphs by introducing Multi-Scale Attribute Affinity (MSA) to capture attribute and topological similarities across multi-hop neighborhoods, and a Three-Phase Optimization (TPO) framework to compute clusters efficiently. TPO avoids quadratic affinity computations by using Random Features to approximate MSA, performs a Greedy Orthogonal Non-Negative Matrix Factorization to factorize a surrogate matrix, and then generates a discrete Normalized Cluster Indicator (NCI) matrix for final clustering. An SVD-based dimension reduction of attributes further compresses input space, preserving MSA while boosting speed and robustness. Empirical results on five real ABGs against 19 baselines show that TPO consistently achieves superior clustering quality and is often more than 40x faster, enabling scalable k-ABGC on graphs with millions of nodes and attributes.

Abstract

Attributed bipartite graphs (ABGs) are an expressive data model for describing the interactions between two sets of heterogeneous nodes that are associated with rich attributes, such as customer-product purchase networks and author-paper authorship graphs. Partitioning the target node set in such graphs into k disjoint clusters (referred to as k-ABGC) finds widespread use in various domains, including social network analysis, recommendation systems, information retrieval, and bioinformatics. However, the majority of existing solutions towards k-ABGC either overlook attribute information or fail to capture bipartite graph structures accurately, engendering severely compromised result quality. The severity of these issues is accentuated in real ABGs, which often encompass millions of nodes and a sheer volume of attribute data, rendering effective k-ABGC over such graphs highly challenging. In this paper, we propose TPO, an effective and efficient approach to k-ABGC that achieves superb clustering performance on multiple real datasets. TPO obtains high clustering quality through two major contributions: (i) a novel formulation and transformation of the k-ABGC problem based on multi-scale attribute affinity specialized for capturing attribute affinities between nodes with the consideration of their multi-hop connections in ABGs, and (ii) a highly efficient solver that includes a suite of carefully-crafted optimizations for sidestepping explicit affinity matrix construction and facilitating faster convergence. Extensive experiments, comparing TPO against 19 baselines over 5 real ABGs, showcase the superior clustering quality of TPO measured against ground-truth labels. Moreover, compared to the state of the arts, TPO is often more than 40x faster over both small and large ABGs.

Effective Clustering on Large Attributed Bipartite Graphs

TL;DR

The paper tackles k-ABGC on large Attributed Bipartite Graphs by introducing Multi-Scale Attribute Affinity (MSA) to capture attribute and topological similarities across multi-hop neighborhoods, and a Three-Phase Optimization (TPO) framework to compute clusters efficiently. TPO avoids quadratic affinity computations by using Random Features to approximate MSA, performs a Greedy Orthogonal Non-Negative Matrix Factorization to factorize a surrogate matrix, and then generates a discrete Normalized Cluster Indicator (NCI) matrix for final clustering. An SVD-based dimension reduction of attributes further compresses input space, preserving MSA while boosting speed and robustness. Empirical results on five real ABGs against 19 baselines show that TPO consistently achieves superior clustering quality and is often more than 40x faster, enabling scalable k-ABGC on graphs with millions of nodes and attributes.

Abstract

Attributed bipartite graphs (ABGs) are an expressive data model for describing the interactions between two sets of heterogeneous nodes that are associated with rich attributes, such as customer-product purchase networks and author-paper authorship graphs. Partitioning the target node set in such graphs into k disjoint clusters (referred to as k-ABGC) finds widespread use in various domains, including social network analysis, recommendation systems, information retrieval, and bioinformatics. However, the majority of existing solutions towards k-ABGC either overlook attribute information or fail to capture bipartite graph structures accurately, engendering severely compromised result quality. The severity of these issues is accentuated in real ABGs, which often encompass millions of nodes and a sheer volume of attribute data, rendering effective k-ABGC over such graphs highly challenging. In this paper, we propose TPO, an effective and efficient approach to k-ABGC that achieves superb clustering performance on multiple real datasets. TPO obtains high clustering quality through two major contributions: (i) a novel formulation and transformation of the k-ABGC problem based on multi-scale attribute affinity specialized for capturing attribute affinities between nodes with the consideration of their multi-hop connections in ABGs, and (ii) a highly efficient solver that includes a suite of carefully-crafted optimizations for sidestepping explicit affinity matrix construction and facilitating faster convergence. Extensive experiments, comparing TPO against 19 baselines over 5 real ABGs, showcase the superior clustering quality of TPO measured against ground-truth labels. Moreover, compared to the state of the arts, TPO is often more than 40x faster over both small and large ABGs.
Paper Structure (23 sections, 6 theorems, 31 equations, 7 figures, 3 tables, 3 algorithms)

This paper contains 23 sections, 6 theorems, 31 equations, 7 figures, 3 tables, 3 algorithms.

Key Result

lemma 1

When $\gamma\rightarrow\infty$, $\mathbf{Z}\xspace_\mathcal{U}\xspace$ in Eq. eq:PX is the closed-form solution to the optimization problem in Eq. eq:Z-obj.

Figures (7)

  • Figure 1: An Illustrative Example of $k$-ABGC
  • Figure 2: Overview of TPO
  • Figure 3: Running time in seconds.
  • Figure 4: Clustering accuracy when varying parameters.
  • Figure 5: A Running Example.
  • ...and 2 more figures

Theorems & Definitions (7)

  • definition 1: $k$-Attributed Bipartite Graph Clustering ($k$-ABGC)
  • lemma 1
  • lemma 2
  • theorem 1
  • lemma 3
  • lemma 4
  • theorem 2: Eckart–Young Theorem gloub1996matrix