Table of Contents
Fetching ...

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

Jinhe Bi, Yifan Wang, Danqi Yan, Aniri, Wenke Huang, Zengjie Jin, Xiaowen Ma, Artur Hecker, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, Yunpu Ma

TL;DR

PRISM identifies representation anisotropy in pre-trained MLLM visual features as a key driver of inefficient data selection for visual instruction tuning. It introduces a training-free Intrinsic Selection paradigm that implicitly re-centers feature distributions, enabling the model’s own intrinsic semantics to reveal data redundancy via a centered correlation-based redundancy score. The method achieves substantial practical gains, reducing end-to-end training time by ~70% and surpassing full-dataset fine-tuning across multiple multimodal and language benchmarks. PRISM also demonstrates strong cross-model transferability and mitigates language knowledge forgetting, highlighting the value of geometry-aware data curation for scalable multimodal learning.

Abstract

Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a \textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise \textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30\% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7\% relative improvement over the baseline. The code is available for access via \href{https://github.com/bibisbar/PRISM}{this repository}.

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

TL;DR

PRISM identifies representation anisotropy in pre-trained MLLM visual features as a key driver of inefficient data selection for visual instruction tuning. It introduces a training-free Intrinsic Selection paradigm that implicitly re-centers feature distributions, enabling the model’s own intrinsic semantics to reveal data redundancy via a centered correlation-based redundancy score. The method achieves substantial practical gains, reducing end-to-end training time by ~70% and surpassing full-dataset fine-tuning across multiple multimodal and language benchmarks. PRISM also demonstrates strong cross-model transferability and mitigates language knowledge forgetting, highlighting the value of geometry-aware data curation for scalable multimodal learning.

Abstract

Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a \textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise \textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30\% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7\% relative improvement over the baseline. The code is available for access via \href{https://github.com/bibisbar/PRISM}{this repository}.

Paper Structure

This paper contains 28 sections, 4 theorems, 10 equations, 6 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

In an anisotropic representation space where $\|\boldsymbol{\mu}\|_2 \gg \mathbb{E}[\|\boldsymbol{\delta}_i\|_2]$, the cosine similarity between any two randomly sampled vectors $\mathbf{x}_i$ and $\mathbf{x}_j$ is dominated by the global drift $\boldsymbol{\mu}$, masking their semantic dissimilarit

Figures (6)

  • Figure 1: PRISM achieves state-of-the-art performance with substantially greater training efficiency.(Left) On a suite of multimodal and language benchmarks, PRISM (blue) uniformly outperforms strong baselines like LLaVA-665K and TIVE. (Right) The training loss curve demonstrates that PRISM converges faster and to a lower loss, reducing training time by 70% compared to the LLaVA baseline.
  • Figure 2: Visual Diagnosis of Representation Anisotropy in Pre-trained MLLM Features.(Left) The per-dimension mean distribution reveals a fundamental disparity: unlike text features (grey) which are well-centered around zero, all visual instruction datasets exhibit a significant non-zero mean. This provides direct evidence that visual embeddings occupy a biased, narrow cone, which is the geometric origin of Global Semantic Drift. (Right) The singular value scree plot confirms this diagnosis. The sharp "elbow point" indicates that the feature variance is confined to a few dominant dimensions—a classic symptom of representation degeneration.
  • Figure 3: End-to-end computational cost (GPU Hours) vs. final model performance. PRISM achieves state-of-the-art performance while significantly reducing total time, thus satisfying the condition OSC < 1.
  • Figure 4: A Comparison of Data Selection Paradigms. Existing methods, such as Proxy-Based and Training-Based Selection, introduce significant computational overhead by relying on external scorers or expensive, iterative training loops to estimate data importance. In contrast, our proposed Intrinsic Selection paradigm operates on a more fundamental principle: we first diagnose a geometric flaw—representation anisotropy—in the MLLM's native visual features. We then apply an implicit re-centering to correct this bias. This isotropic restoration unlocks the intrinsic semantic structure of the data, enabling highly effective and training-free pruning in a single pass.
  • Figure 5: Performance trade-off as a function of PRISM's sampling ratio. Visual performance (blue) improves with more data, while textual performance (grey) is best preserved with smaller subsets.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Theorem 1
  • proof
  • Corollary 2
  • Theorem 3: Corruption of Geometric Proximity
  • proof
  • Theorem 4: The Computational Cost of Anisotropic Selection
  • proof