Table of Contents
Fetching ...

A Bayesian Approach to Data Point Selection

Xinnuo Xu, Minyoung Kim, Royson Lee, Brais Martinez, Timothy Hospedales

TL;DR

This work views the DPS problem as posterior inference in a novel Bayesian model where the posterior distributions of the instance-wise weights and the main neural network parameters are inferred under a reasonable prior and likelihood model.

Abstract

Data point selection (DPS) is becoming a critical topic in deep learning due to the ease of acquiring uncurated training data compared to the difficulty of obtaining curated or processed data. Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation, which is demanding in terms of memory and computation, and exhibits some theoretical defects regarding minibatches. Thus, we propose a novel Bayesian approach to DPS. We view the DPS problem as posterior inference in a novel Bayesian model where the posterior distributions of the instance-wise weights and the main neural network parameters are inferred under a reasonable prior and likelihood model. We employ stochastic gradient Langevin MCMC sampling to learn the main network and instance-wise weights jointly, ensuring convergence even with minibatches. Our update equation is comparable to the widely used SGD and much more efficient than existing BLO-based methods. Through controlled experiments in both the vision and language domains, we present the proof-of-concept. Additionally, we demonstrate that our method scales effectively to large language models and facilitates automated per-task optimization for instruction fine-tuning datasets.

A Bayesian Approach to Data Point Selection

TL;DR

This work views the DPS problem as posterior inference in a novel Bayesian model where the posterior distributions of the instance-wise weights and the main neural network parameters are inferred under a reasonable prior and likelihood model.

Abstract

Data point selection (DPS) is becoming a critical topic in deep learning due to the ease of acquiring uncurated training data compared to the difficulty of obtaining curated or processed data. Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation, which is demanding in terms of memory and computation, and exhibits some theoretical defects regarding minibatches. Thus, we propose a novel Bayesian approach to DPS. We view the DPS problem as posterior inference in a novel Bayesian model where the posterior distributions of the instance-wise weights and the main neural network parameters are inferred under a reasonable prior and likelihood model. We employ stochastic gradient Langevin MCMC sampling to learn the main network and instance-wise weights jointly, ensuring convergence even with minibatches. Our update equation is comparable to the widely used SGD and much more efficient than existing BLO-based methods. Through controlled experiments in both the vision and language domains, we present the proof-of-concept. Additionally, we demonstrate that our method scales effectively to large language models and facilitates automated per-task optimization for instruction fine-tuning datasets.

Paper Structure

This paper contains 39 sections, 1 theorem, 15 equations, 14 figures, 4 tables.

Key Result

Theorem C.3

Let $d = \dim(\theta) + N_t$, $B$ be the batch size, and $\rho$ be the Cheeger constant. For any $\epsilon \in (0,1)$, with the initial iterate satisfying $p(\| \| \leq R/2) \leq \epsilon/16$ for $R = \overline{R}(\epsilon K^{-1}/12)$, and step size $\eta = \tilde{O}(\min\{\rho^2 d^{-2}, B^2 \rho^2 for some constant $\lambda>0$, $C_0 = \tilde{O}(\rho^2)$, $C_1 = \tilde{O}(R d \rho^{-1})$, $C_2 =

Figures (14)

  • Figure 1: Graphical model for BADS. Shaded nodes, representing curated ($D_m$) and uncurated ($D_t$) data, are evidence. Unshaded nodes, including model $\theta$ and instance weights $w$, are random variables.
  • Figure 2: Proof-of-Concept experiment results. The top row displays the overall test performance across the three scenarios throughout the training phase, with x and y axis denote the training steps and the evaluation metrics, respectively. The bottom row visualizes the model-predicted weights of data points in each mini-batches in the final 2000 steps in WebNLG training (scenario 3). x and y axis show the training steps and average weights, respectively. Data points in blue color are expected to get higher weights compared to their counterparts (in red color).
  • Figure 3: The MNIST test accuracy when trained with meta sets in varying sizes (x-aixs).
  • Figure 4: The CIFAR test accuracy when trained with 80% noisy data.
  • Figure 5: The CIFAR test accuracy when trained with 20% noisy data.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Theorem C.3: Adjusted from Theorem 4.5 in conv_sgld