Table of Contents
Fetching ...

SkillRater: Untangling Capabilities in Multimodal Data

Naveen Sahi, Jeremy Dohmann, Armen Aghajanyan, Akshat Shrivastava

TL;DR

SkillRater addresses the limitation of single-scalar data quality by decomposing filtering into orthogonal, capability-specific raters trained with bilevel meta-learning. Each rater targets a distinct capability (visual understanding, OCR, STEM) and is optimized on a disjoint validation objective, with a curriculum that unions their signals to preserve diversity early and refine high-value samples later. The approach yields clear held-out gains over unfiltered training and monolithic DataRater across vision-language benchmarks, and raters trained at 1B parameters transfer to 2B without retraining. The near-orthogonality of the learned signals and the demonstrated scale transfer highlight the practical benefits of multidimensional quality in multimodal mid-training and suggest broader applicability to other domains requiring competing capabilities.

Abstract

Data curation methods typically assign samples a single quality score. We argue this scalar framing is fundamentally limited: when training requires multiple distinct capabilities, a monolithic scorer cannot maximize useful signals for all of them simultaneously. Quality is better understood as multidimensional, with each dimension corresponding to a capability the model must acquire. We introduce SkillRater, a framework that decomposes data filtering into specialized raters - one per capability, each trained via meta-learning on a disjoint validation objective - and composes their scores through a progressive selection rule: at each training stage, a sample is retained if any rater ranks it above a threshold that tightens over time, preserving diversity early while concentrating on high-value samples late. We validate this approach on vision language models, decomposing quality into three capability dimensions: visual understanding, OCR, and STEM reasoning. At 2B parameters, SkillRater improves over unfiltered baselines by 5.63% on visual understanding, 2.00% on OCR, and 3.53% on STEM on held out benchmarks. The learned rater signals are near orthogonal, confirming that the decomposition captures genuinely independent quality dimensions and explaining why it outperforms both unfiltered training and monolithic learned filtering.

SkillRater: Untangling Capabilities in Multimodal Data

TL;DR

SkillRater addresses the limitation of single-scalar data quality by decomposing filtering into orthogonal, capability-specific raters trained with bilevel meta-learning. Each rater targets a distinct capability (visual understanding, OCR, STEM) and is optimized on a disjoint validation objective, with a curriculum that unions their signals to preserve diversity early and refine high-value samples later. The approach yields clear held-out gains over unfiltered training and monolithic DataRater across vision-language benchmarks, and raters trained at 1B parameters transfer to 2B without retraining. The near-orthogonality of the learned signals and the demonstrated scale transfer highlight the practical benefits of multidimensional quality in multimodal mid-training and suggest broader applicability to other domains requiring competing capabilities.

Abstract

Data curation methods typically assign samples a single quality score. We argue this scalar framing is fundamentally limited: when training requires multiple distinct capabilities, a monolithic scorer cannot maximize useful signals for all of them simultaneously. Quality is better understood as multidimensional, with each dimension corresponding to a capability the model must acquire. We introduce SkillRater, a framework that decomposes data filtering into specialized raters - one per capability, each trained via meta-learning on a disjoint validation objective - and composes their scores through a progressive selection rule: at each training stage, a sample is retained if any rater ranks it above a threshold that tightens over time, preserving diversity early while concentrating on high-value samples late. We validate this approach on vision language models, decomposing quality into three capability dimensions: visual understanding, OCR, and STEM reasoning. At 2B parameters, SkillRater improves over unfiltered baselines by 5.63% on visual understanding, 2.00% on OCR, and 3.53% on STEM on held out benchmarks. The learned rater signals are near orthogonal, confirming that the decomposition captures genuinely independent quality dimensions and explaining why it outperforms both unfiltered training and monolithic learned filtering.
Paper Structure (34 sections, 11 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 11 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: The SkillRater pipeline. Capability aligned benchmarks and a training data pool are used in a bilevel meta-learning framework to learn per-capability raters (Visual Understanding, OCR, STEM). These raters score the training data, which is then filtered via a curriculum that progressively selects higher quality samples throughout training.
  • Figure 2: Scale transfer at 1B and 2B parameters. Raters trained at 1B improve both scales without retraining, and the curriculum gap over unfiltered training widens over steps.
  • Figure 3: Orthogonality of capability-specific rater signals. Top-scoring subsets from each rater align with distinct directions in score space, indicating low overlap and complementary selection.
  • Figure 4: Qualitative retained and filtered samples for each capability-specific rater. Retained examples demand capability-aligned reasoning, while filtered examples are shallow or mismatched for that capability.