Table of Contents
Fetching ...

OASIS: Online Sample Selection for Continual Visual Instruction Tuning

Minjae Lee, Minhyuk Seo, Tingyu Qu, Tinne Tuytelaars, Jonghyun Choi

TL;DR

Continual instruction tuning on streaming data faces training delays and forgetting as distributions shift. OASIS tackles this with two components: ORIS estimates sample informativeness from last‑layer gradients and normalizes across batches using EMA/EMV to obtain a cross‑batch relative score; SIREN then reduces redundancy by accounting for gradient similarity and higher‑order overlaps within the batch, without re-forwarding. Selection is probabilistic, guided by a thresholded sigmoid over the relative informativeness, enabling flexible data quotas. Empirical results across multiple large models and CIT benchmarks show OASIS achieves near full‑data performance using as little as 25% of data, with better efficiency and diversity than prior baselines.

Abstract

In continual instruction tuning (CIT) scenarios, where new instruction tuning data continuously arrive in an online streaming manner, training delays from large-scale data significantly hinder real-time adaptation. Data selection can mitigate this overhead, but existing strategies often rely on pretrained reference models, which are impractical in CIT setups since future data are unknown. Recent reference model-free online sample selection methods address this, but typically select a fixed number of samples per batch (e.g., top-k), making them vulnerable to distribution shifts where informativeness varies across batches. To address these limitations, we propose OASIS, an adaptive online sample selection approach for CIT that (1) selects informative samples by estimating each sample's informativeness relative to all previously seen data, beyond batch-level constraints, and (2) minimizes informative redundancy of selected samples through iterative selection score updates. Experiments on various large foundation models show that OASIS, using only 25 percent of the data, achieves comparable performance to full-data training and outperforms the state-of-the-art sampling methods.

OASIS: Online Sample Selection for Continual Visual Instruction Tuning

TL;DR

Continual instruction tuning on streaming data faces training delays and forgetting as distributions shift. OASIS tackles this with two components: ORIS estimates sample informativeness from last‑layer gradients and normalizes across batches using EMA/EMV to obtain a cross‑batch relative score; SIREN then reduces redundancy by accounting for gradient similarity and higher‑order overlaps within the batch, without re-forwarding. Selection is probabilistic, guided by a thresholded sigmoid over the relative informativeness, enabling flexible data quotas. Empirical results across multiple large models and CIT benchmarks show OASIS achieves near full‑data performance using as little as 25% of data, with better efficiency and diversity than prior baselines.

Abstract

In continual instruction tuning (CIT) scenarios, where new instruction tuning data continuously arrive in an online streaming manner, training delays from large-scale data significantly hinder real-time adaptation. Data selection can mitigate this overhead, but existing strategies often rely on pretrained reference models, which are impractical in CIT setups since future data are unknown. Recent reference model-free online sample selection methods address this, but typically select a fixed number of samples per batch (e.g., top-k), making them vulnerable to distribution shifts where informativeness varies across batches. To address these limitations, we propose OASIS, an adaptive online sample selection approach for CIT that (1) selects informative samples by estimating each sample's informativeness relative to all previously seen data, beyond batch-level constraints, and (2) minimizes informative redundancy of selected samples through iterative selection score updates. Experiments on various large foundation models show that OASIS, using only 25 percent of the data, achieves comparable performance to full-data training and outperforms the state-of-the-art sampling methods.

Paper Structure

This paper contains 50 sections, 4 theorems, 27 equations, 7 figures, 19 tables.

Key Result

Theorem 3.1

Let $I^{(t)}_i$ be defined by Eq. eq:select for samples in $\mathcal{B}_t$, and let $(\mu_{t-1},\sigma^2_{t-1})$ be the EMA and EMV from Eq. eq:EMA_EMV computed from batches prior to $t$. Assume local stationarity and local weak dependence of $\{I^{(t)}_i\}$ with uniformly bounded second moments. Th

Figures (7)

  • Figure 1: Real-time adaptation under equal training time. Bar width indicates training data volume in the CIT data stream. While 'Full' trains on all data, TIVE liu2024less, Adapt-$\infty$maharana2025adaptinfty, and OASIS use 25% selected data. Under equal training time, 'Full' degrades on newly arrived tasks (e.g., $T_3$, $T_4$), since sequential training on all data provides sufficient time for earlier tasks (e.g., $T_1$, $T_2$) but insufficient time for new ones. TIVE and Adapt-$\infty$ achieve only marginal speedup despite using 25% data, as the backward-pass selection overhead limits real-time adaptation. OASIS uses inference-only selection with minimal overhead, enabling efficient sample selection and strong adaptation to new tasks.
  • Figure 2: Overview of our proposed OASIS. For each online batch $\mathcal{B}_t$: (1) OASIS first scores the informativeness $I$ for all sample in batch $\mathcal{B}_t=\{d^{(t)}_1, d^{(t)}_2, ...\}$ (Eq. \ref{['eq:select']}); (2) It then iteratively reduces redundancy by adjusting $I$ of other samples to $\widetilde{I}$ based on their gradient similarity $S_{i,j}$ to the most informative sample (here, $d^{(t)}_2$). (Sec. \ref{['subsec:ours_two']}); (3) OASIS computes relative informativeness $\hat{I}$ by normalizing the updated informativeness $\widetilde{I}$ using EMA $\mu_t$ and EMV $\sigma_t$ (Eq. \ref{['eq:normalize']}); (4) Finally, OASIS computes selection probability $P_S$ and selects samples exceeding a uniformly drawn threshold $r$, resulting in a selected subset $\mathcal{B}^*_t \subset \mathcal{B}_t$ (Eq. \ref{['eq:prob_selection']}). Model $f_{\theta}$ is then trained using only $\mathcal{B}^*_t$.
  • Figure 3: Accuracy and FLOPs with 25% selection ratio on MICVIT. The top-left corner illustrates effective and efficient sample selection.
  • Figure 4: Comparison of fast adaptation performance. After CIT of LLaVA-1.5-7B on subsets (25% of the full data), selected using each sample selection baseline from MICVIT, we fine-tune the model for 100 epochs on each downstream task (i.e., VISION, COMICS, MagicBrush, and DreamSim).
  • Figure 5: Comparison of fast adaptation performance. After CIT of LLaVA-1.5-7B on subsets (25% of the full data), selected using each sample selection baseline from COAST, we fine-tune the model for 100 epochs on each downstream task (i.e., VISION, COMICS, MagicBrush, and DreamSim).
  • ...and 2 more figures

Theorems & Definitions (6)

  • Theorem 3.1
  • Lemma A.1: $\hat{I}^{(t)}$ has finite $(2+\delta)$-moments
  • Lemma A.2: Normal approximation rate for FI trace via chi-square
  • Lemma A.3: EMA/EMV are consistent approximations of the true mean and variance
  • proof : Proof Lemma \ref{['lem:ratio_consistency_easy']}.
  • proof : Proof of Theorem \ref{['thm:main_normal']}