Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

Rujie Wu; Haozhe Zhao; Hai Ci; Yizhou Wang

Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

Rujie Wu, Haozhe Zhao, Hai Ci, Yizhou Wang

Abstract

Multimodal instruction tuning is often compute-inefficient because training budgets are spread across large mixed image-video pools whose utility is highly uneven. We present Goal-Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1$\times$ training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra-long-video setting and the mismatch between that benchmark and the short-video/image-dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long-video understanding behavior. Overall, GDO provides a goal-driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at https://github.com/rujiewu/GDO.

Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

Abstract

training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra-long-video setting and the mismatch between that benchmark and the short-video/image-dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long-video understanding behavior. Overall, GDO provides a goal-driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at https://github.com/rujiewu/GDO.

Paper Structure (24 sections, 14 equations, 11 figures, 6 tables)

This paper contains 24 sections, 14 equations, 11 figures, 6 tables.

Introduction
Method
Problem Setup
Six Sample Descriptors
Scoring and Feasibility
Goal Profiles
Experiments
Experimental Setup
Benchmark Results
Subtask Analysis
Ablation Study
Discussion
Related Work
Conclusion
Additional Analysis
...and 9 more sections

Figures (11)

Figure 1: Peak Match with Less Data. Accuracy is plotted against training samples for MVBench, VideoMME, and MLVU, comparing the GDO trajectory with the fixed Uni-10x baseline. The dashed line marks the Uni-10x reference, and the star marks the first displayed GDO point that reaches it. Table \ref{['tab:intro-snapshot']} reports the corresponding Peak Match and Reduction values; all four benchmarks exceed 10$\times$ reduction relative to the 512k-sample Uni-10x reference.
Figure 2: Subset Construction.GDO computes six sample descriptors over one shared pool, applies a shared score and goal-specific feasibility presets to build 1$\times$ optimized subsets, and compares them against Uni-10x under the same fixed backbone, SFT recipe, checkpoints, and benchmarks. The figure visualizes the paper's comparison contract: only data allocation changes.
Figure 3: Frontier Shifts by Goal. For each benchmark, the dots mark the strongest operating points attained by the four released GDO profiles, while the square marks the fixed 512k-sample Uni-10x baseline. Different allocation goals populate different parts of the frontier under the same training and evaluation contract.
Figure 4: MVBench Subtask Delta Heatmap. Each cell shows the final displayed difference between a GDO profile and the fixed Uni-10x branch on the same MVBench subtask. The heatmap complements the subtask summary in Table \ref{['tab:subtask-deltas']} by making both gain concentration and boundary trade-offs visible across the full subtask map.
Figure 5: Trajectories. Each panel compares GDO with the fixed 512k-sample Uni-10x baseline on one benchmark. Stars denote the earliest peak-match point.
...and 6 more figures

Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

Abstract

Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

Authors

Abstract

Table of Contents

Figures (11)