Table of Contents
Fetching ...

On Representation Redundancy in Large-Scale Instruction Tuning Data Selection

Youwei Shu, Shaomian Zheng, Dingnan Jin, Wenjie Qu, Ziyao Guo, Qing Cui, Jun Zhou, Jiaheng Zhang

TL;DR

This work proposes Compressed Representation Data Selection (CRDS), a novel framework with two variants that substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods.

Abstract

Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods. Notably, CRDS-W achieves strong performance using only 3.5% of the data, surpassing the full-data baseline by an average of 0.71% across four datasets. Our code is available at https://github.com/tdano1/CRDS.

On Representation Redundancy in Large-Scale Instruction Tuning Data Selection

TL;DR

This work proposes Compressed Representation Data Selection (CRDS), a novel framework with two variants that substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods.

Abstract

Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods. Notably, CRDS-W achieves strong performance using only 3.5% of the data, surpassing the full-data baseline by an average of 0.71% across four datasets. Our code is available at https://github.com/tdano1/CRDS.
Paper Structure (36 sections, 12 equations, 5 figures, 13 tables, 3 algorithms)

This paper contains 36 sections, 12 equations, 5 figures, 13 tables, 3 algorithms.

Figures (5)

  • Figure 1: Illustration of our proposed methods and the SOTA baseline RDS+. RDS+ constructs data representations by directly using the last hidden state of the encoder. CRDS-R constructs representations by extracting the last several hidden states of the encoder, applying a Rademacher projection to each layer, and concatenating the projected features to obtain the final representation, which retains the same dimensionality as the original representation. CRDS-W constructs representations by first fitting a whitening transformer on a large subset of the data using the last hidden state of the encoder, and then applying this transformer to whiten all data representations for similarity computation.
  • Figure 2: Overview of the proposed pipeline. Black, purple, and orange arrows denote the public flow, CRDS-R flow, and CRDS-W flow, respectively. Compared with the simpler CRDS-R, CRDS-W introduces additional gathering, fitting, and saving operations to support the whitening setting. The algorithmic details are provided in Appendix \ref{['app:alg']}.
  • Figure 3: Results of the ablation studies are presented in Section \ref{['ablation']}. See settings in Table \ref{['exp_settings']}.
  • Figure 4: Effect of $H$ on systematic extraction in CRDS-R. See settings in Table \ref{['exp_settings']}.
  • Figure : Distributed Data Similarity Computation