Table of Contents
Fetching ...

Stochastic Variance-Reduced Iterative Hard Thresholding in Graph Sparsity Optimization

Derek Fox, Samuel Hernandez, Qianqian Tong

TL;DR

The paper tackles graph-structured sparsity optimization under large-scale data where stochastic gradients suffer from variance. It introduces two stochastic variance-reduced gradient methods, GraphSVRG-IHT and GraphSCSG-IHT, that incorporate head and tail projections to enforce graph-structured sparsity and leverage variance-reduction techniques for non-convex objectives. The authors provide a general theoretical framework proving linear convergence with a constant learning rate, and validate the approach experimentally on synthetic data and a real breast cancer gene dataset, showing improved convergence and better gene selection performance. The work advances efficient, scalable sparsity-constrained learning in graph-structured domains with potential impact on disease monitoring and network analysis, and lays groundwork for applying these methods to larger real-world datasets.

Abstract

Stochastic optimization algorithms are widely used for large-scale data analysis due to their low per-iteration costs, but they often suffer from slow asymptotic convergence caused by inherent variance. Variance-reduced techniques have been therefore used to address this issue in structured sparse models utilizing sparsity-inducing norms or $\ell_0$-norms. However, these techniques are not directly applicable to complex (non-convex) graph sparsity models, which are essential in applications like disease outbreak monitoring and social network analysis. In this paper, we introduce two stochastic variance-reduced gradient-based methods to solve graph sparsity optimization: GraphSVRG-IHT and GraphSCSG-IHT. We provide a general framework for theoretical analysis, demonstrating that our methods enjoy a linear convergence speed. Extensive experiments validate

Stochastic Variance-Reduced Iterative Hard Thresholding in Graph Sparsity Optimization

TL;DR

The paper tackles graph-structured sparsity optimization under large-scale data where stochastic gradients suffer from variance. It introduces two stochastic variance-reduced gradient methods, GraphSVRG-IHT and GraphSCSG-IHT, that incorporate head and tail projections to enforce graph-structured sparsity and leverage variance-reduction techniques for non-convex objectives. The authors provide a general theoretical framework proving linear convergence with a constant learning rate, and validate the approach experimentally on synthetic data and a real breast cancer gene dataset, showing improved convergence and better gene selection performance. The work advances efficient, scalable sparsity-constrained learning in graph-structured domains with potential impact on disease monitoring and network analysis, and lays groundwork for applying these methods to larger real-world datasets.

Abstract

Stochastic optimization algorithms are widely used for large-scale data analysis due to their low per-iteration costs, but they often suffer from slow asymptotic convergence caused by inherent variance. Variance-reduced techniques have been therefore used to address this issue in structured sparse models utilizing sparsity-inducing norms or -norms. However, these techniques are not directly applicable to complex (non-convex) graph sparsity models, which are essential in applications like disease outbreak monitoring and social network analysis. In this paper, we introduce two stochastic variance-reduced gradient-based methods to solve graph sparsity optimization: GraphSVRG-IHT and GraphSCSG-IHT. We provide a general framework for theoretical analysis, demonstrating that our methods enjoy a linear convergence speed. Extensive experiments validate
Paper Structure (17 sections, 3 theorems, 42 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 17 sections, 3 theorems, 42 equations, 5 figures, 4 tables, 2 algorithms.

Key Result

Lemma 1

zhou2019stochastic If each $f_{\xi_t}(\cdot)$ and $F(x)$ satisfy Assumption ass:three, and given head projection model $(c_H, M \oplus M_T, M_H)$ and tail projection model $(c_T, M, M_T)$, then we have the following inequality where and $\tau \in (0, 2/\beta)$.

Figures (5)

  • Figure 1: Comparison of methods with different learning rate.
  • Figure 2: Comparison of methods with different sparsities.
  • Figure 3: Number of data points vs. residual loss value with different batch size.
  • Figure 4: Various choices for $B$ and $b$
  • Figure 5: Different number of Connected Components

Theorems & Definitions (11)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Lemma 1
  • Theorem 2
  • Corollary 2.1
  • proof
  • proof
  • proof
  • ...and 1 more