Table of Contents
Fetching ...

Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Hongzhe Huang, Jiang Liu, Zhewen Yu, Li Cai, Dian Jiao, Wenqiao Zhang, Siliang Tang, Juncheng Li, Hao Jiang, Haoyuan Li, Yueting Zhuang

TL;DR

Align^2LLaVA introduces a cascaded alignment framework that compresses synthetic multimodal instruction data by first aligning with human preferences via reward models and then harmonizing instruction writing style with an inner LLM through rewrite and review. The two-stage data filtration achieves up to 90% reduction in data while maintaining or improving performance across eight multimodal benchmarks, demonstrating strong data-efficiency and transferability to different LLM backbones. Key contributions include the Align^2LLaVA pipeline, the Align^2LLaVA-Instruct dataset, and extensive ablations and analyses validating the necessity of both human and LLM alignment steps. The work suggests a viable path toward more efficient, high-quality multimodal instruction tuning in large-scale settings, with implications for cross-LLM applicability and practical deployment.

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs), such as LLaVA-series models, are driven by massive machine-generated instruction-following data tuning. Such automatic instruction collection pipelines, however, inadvertently introduce significant variability in data quality. This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment, to compress this vast corpus of machine-generated multimodal instructions to a compact and high-quality form: (i) For human preference alignment, we have collected a machine-generated multimodal instruction dataset and established a comprehensive set of both subjective and objective criteria to guide the data quality assessment critically from human experts. By doing so, a reward model was trained on the annotated dataset to internalize the nuanced human understanding of instruction alignment. (ii) For LLM preference alignment, given the instruction selected by the reward model, we propose leveraging the inner LLM used in MLLM to align the writing style of visual instructions with that of the inner LLM itself, resulting in LLM-aligned instruction improvement. Extensive experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%. Impressively, by aggressively reducing the training instructions from 158k to 14k (9$\times$ smaller), our model consistently outperforms its full-size dataset counterpart across various MLLM benchmarks. Our project is available at https://github.com/DCDmllm/Align2LLaVA.

Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

TL;DR

Align^2LLaVA introduces a cascaded alignment framework that compresses synthetic multimodal instruction data by first aligning with human preferences via reward models and then harmonizing instruction writing style with an inner LLM through rewrite and review. The two-stage data filtration achieves up to 90% reduction in data while maintaining or improving performance across eight multimodal benchmarks, demonstrating strong data-efficiency and transferability to different LLM backbones. Key contributions include the Align^2LLaVA pipeline, the Align^2LLaVA-Instruct dataset, and extensive ablations and analyses validating the necessity of both human and LLM alignment steps. The work suggests a viable path toward more efficient, high-quality multimodal instruction tuning in large-scale settings, with implications for cross-LLM applicability and practical deployment.

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs), such as LLaVA-series models, are driven by massive machine-generated instruction-following data tuning. Such automatic instruction collection pipelines, however, inadvertently introduce significant variability in data quality. This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment, to compress this vast corpus of machine-generated multimodal instructions to a compact and high-quality form: (i) For human preference alignment, we have collected a machine-generated multimodal instruction dataset and established a comprehensive set of both subjective and objective criteria to guide the data quality assessment critically from human experts. By doing so, a reward model was trained on the annotated dataset to internalize the nuanced human understanding of instruction alignment. (ii) For LLM preference alignment, given the instruction selected by the reward model, we propose leveraging the inner LLM used in MLLM to align the writing style of visual instructions with that of the inner LLM itself, resulting in LLM-aligned instruction improvement. Extensive experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%. Impressively, by aggressively reducing the training instructions from 158k to 14k (9 smaller), our model consistently outperforms its full-size dataset counterpart across various MLLM benchmarks. Our project is available at https://github.com/DCDmllm/Align2LLaVA.
Paper Structure (33 sections, 1 equation, 4 figures, 16 tables)

This paper contains 33 sections, 1 equation, 4 figures, 16 tables.

Figures (4)

  • Figure 1: (a) An example of synthetic visual instruction data generated by LLMs. (b) Demonstration of proposed cascaded human and LLM preference alignment. (c) Across 8 benchmarks, our approach achieves comparable or superior performance to LLaVA-1.5 trained on the full dataset (top), using a significantly reduced instruction set (bottom), demonstrating the efficiency of our method.
  • Figure 2: An overview of our data curation pipeline incorporating human knowledge and LLM characteristics. The process comprises three sequential steps: (1) Human preference data is curated through manual annotation on LLM generated questions and answers. (2) Two reward models are trained to align with human values, and subsequently utilized for large-scale data filtration. (3) An inner LLM is employed to rewrite and review the selected instructions.
  • Figure 3: Human evaluation of LLaVA and Align$^2$LLaVA.
  • Figure 4: Performance transferring to different LLMs.