Table of Contents
Fetching ...

Gradient-based Discrete Sampling with Automatic Cyclical Scheduling

Patrick Pynadath, Riddhiman Bhattacharya, Arun Hariharan, Ruqi Zhang

TL;DR

The non-asymptotic convergence and inference guarantee for the non-asymptotic convergence and inference guarantee for the automatic cyclical scheduling method in general discrete distributions is proved.

Abstract

Discrete distributions, particularly in high-dimensional deep models, are often highly multimodal due to inherent discontinuities. While gradient-based discrete sampling has proven effective, it is susceptible to becoming trapped in local modes due to the gradient information. To tackle this challenge, we propose an automatic cyclical scheduling, designed for efficient and accurate sampling in multimodal discrete distributions. Our method contains three key components: (1) a cyclical step size schedule where large steps discover new modes and small steps exploit each mode; (2) a cyclical balancing schedule, ensuring "balanced" proposals for given step sizes and high efficiency of the Markov chain; and (3) an automatic tuning scheme for adjusting the hyperparameters in the cyclical schedules, allowing adaptability across diverse datasets with minimal tuning. We prove the non-asymptotic convergence and inference guarantee for our method in general discrete distributions. Extensive experiments demonstrate the superiority of our method in sampling complex multimodal discrete distributions.

Gradient-based Discrete Sampling with Automatic Cyclical Scheduling

TL;DR

The non-asymptotic convergence and inference guarantee for the non-asymptotic convergence and inference guarantee for the automatic cyclical scheduling method in general discrete distributions is proved.

Abstract

Discrete distributions, particularly in high-dimensional deep models, are often highly multimodal due to inherent discontinuities. While gradient-based discrete sampling has proven effective, it is susceptible to becoming trapped in local modes due to the gradient information. To tackle this challenge, we propose an automatic cyclical scheduling, designed for efficient and accurate sampling in multimodal discrete distributions. Our method contains three key components: (1) a cyclical step size schedule where large steps discover new modes and small steps exploit each mode; (2) a cyclical balancing schedule, ensuring "balanced" proposals for given step sizes and high efficiency of the Markov chain; and (3) an automatic tuning scheme for adjusting the hyperparameters in the cyclical schedules, allowing adaptability across diverse datasets with minimal tuning. We prove the non-asymptotic convergence and inference guarantee for our method in general discrete distributions. Extensive experiments demonstrate the superiority of our method in sampling complex multimodal discrete distributions.
Paper Structure (75 sections, 6 theorems, 54 equations, 15 figures, 4 tables, 6 algorithms)

This paper contains 75 sections, 6 theorems, 54 equations, 15 figures, 4 tables, 6 algorithms.

Key Result

Lemma 5.3

Let Assumptions assm:g:Lipschitz-assm:Hessian with $\alpha <\frac{1}{ \beta M}$ hold. Then for the Markov chain $P$ we have, for any $\theta, \theta' \in \Theta$, where with $a \in \mathop{\mathrm{arg\,min}}\limits_{\theta \in \Theta} \|\nabla U(\theta)\|$ .

Figures (15)

  • Figure 1: Sampling on a 2d distribution with multiple modes. (a): ground truth. (b): results from a random walk sampler. (c): results from DMALA zhang2022langevinlike with a manually tuned step size. (d): results from AB sun2023anyscale. (e): results from our method ACS. While the random walk sampler can find all modes, its characterization is noisy and lacks details for each mode. Gradient-based samplers (b) and (c) effectively characterize a specific mode but are easily trapped in some local modes. Our method (d) can find all modes efficiently and characterize each mode accurately.
  • Figure 2: (a) $\alpha$-schedule along with the corresponding $\beta$ schedule. The initial large steps enable the sampler to explore different regions of the distribution, while the smaller steps enable good characterization of each region. The balancing parameter $\beta$ varies with the step size to enable high acceptance rates for all step sizes. (b) Acceptance rate v.s. step size on EBM sampling on MNIST shows a non-monotonic relationship.
  • Figure 3: Sampling performance of various methods. Top row demonstrates convergence to ground truth on RBMs, bottom row demonstrates convergence speed on deep EBMs. We report the average performance across 11 random seeds within 1 standard error for the top row, and we show the average performance for the bottom row, as the error area is not visibly clear. For both distribution types, ACS demonstrates competitive performance with all baselines.
  • Figure 4: Average performance across multiple seeds for various hyper-parameter settings. We note that all configurations to exhibit convergence to the ground truth as indicated by the maximum mean discrepancy (log MMD), albeit with varying convergence speeds. In some cases, specific hyper-parameter configurations are able to achieve better performance than what we report in the RBM sampling experiment. Overall, we can observe that our algorithm is reasonably robust to various hyper-parameter configurations as it will still demonstrate convergent behavior towards the ground truth.
  • Figure 5: Uneven multi-modal target distribution. While the top-left mode does have the most mass, only sampling from this mode will result in an inaccurate representation of the target distribution.
  • ...and 10 more figures

Theorems & Definitions (13)

  • Lemma 5.3
  • proof
  • Theorem 5.4
  • proof
  • Theorem 5.5
  • proof
  • proof
  • Proposition C.1
  • proof
  • Corollary C.2
  • ...and 3 more