Table of Contents
Fetching ...

Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

Jiaming Ji, Jiayi Zhou, Hantao Lou, Boyuan Chen, Donghai Hong, Xuyao Wang, Wenqi Chen, Kaile Wang, Rui Pan, Jiahao Li, Mohan Wang, Josef Dai, Tianyi Qiu, Hua Xu, Dong Li, Weipeng Chen, Jun Song, Bo Zheng, Yaodong Yang

TL;DR

Align Anything introduces a unified, language-feedback–driven approach to align all-modality models with human preferences across text, image, audio, and video. It presents align-anything-200k, the first all-modality preference dataset, and Learning from Language Feedback (LLF), a two-stage method that uses critique and refinement to synthesize high-quality preference data beyond binary ratings. The paper also delivers Eval-Anything, a dedicated benchmark for evaluating all-modality understanding and generation, including modality selection and synergy. Together, these contributions enable scalable cross-modal instruction-following with open-source resources, advancing practical all-modality alignment and evaluation. The work also discusses ethical considerations, limitations, and future directions toward truly integrated multimodal systems.

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the instruction-following capabilities of large language models; however, it remains underexplored in the cross-modality domain. As the number of modalities increases, aligning all-modality models with human intentions -- such as instruction following -- becomes a pressing challenge. In this work, we make the first attempt to fine-tune all-modality models (i.e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions. This endeavor presents several challenges. First, there is no large-scale all-modality human preference data in existing open-source resources, as most datasets are limited to specific modalities, predominantly text and image. Secondly, the effectiveness of binary preferences in RLHF for post-training alignment in complex all-modality scenarios remains an unexplored area. Finally, there is a lack of a systematic framework to evaluate the capabilities of all-modality models, particularly regarding modality selection and synergy. To address these challenges, we propose the align-anything framework, which includes meticulously annotated 200k all-modality human preference data. Then, we introduce an alignment method that learns from unified language feedback, effectively capturing complex modality-specific human preferences and enhancing the model's instruction-following capabilities. Furthermore, to assess performance improvements in all-modality models after post-training alignment, we construct a challenging all-modality capability evaluation framework -- eval-anything. All data, models, and code frameworks have been open-sourced for the community. For more details, please refer to https://github.com/PKU-Alignment/align-anything.

Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

TL;DR

Align Anything introduces a unified, language-feedback–driven approach to align all-modality models with human preferences across text, image, audio, and video. It presents align-anything-200k, the first all-modality preference dataset, and Learning from Language Feedback (LLF), a two-stage method that uses critique and refinement to synthesize high-quality preference data beyond binary ratings. The paper also delivers Eval-Anything, a dedicated benchmark for evaluating all-modality understanding and generation, including modality selection and synergy. Together, these contributions enable scalable cross-modal instruction-following with open-source resources, advancing practical all-modality alignment and evaluation. The work also discusses ethical considerations, limitations, and future directions toward truly integrated multimodal systems.

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the instruction-following capabilities of large language models; however, it remains underexplored in the cross-modality domain. As the number of modalities increases, aligning all-modality models with human intentions -- such as instruction following -- becomes a pressing challenge. In this work, we make the first attempt to fine-tune all-modality models (i.e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions. This endeavor presents several challenges. First, there is no large-scale all-modality human preference data in existing open-source resources, as most datasets are limited to specific modalities, predominantly text and image. Secondly, the effectiveness of binary preferences in RLHF for post-training alignment in complex all-modality scenarios remains an unexplored area. Finally, there is a lack of a systematic framework to evaluate the capabilities of all-modality models, particularly regarding modality selection and synergy. To address these challenges, we propose the align-anything framework, which includes meticulously annotated 200k all-modality human preference data. Then, we introduce an alignment method that learns from unified language feedback, effectively capturing complex modality-specific human preferences and enhancing the model's instruction-following capabilities. Furthermore, to assess performance improvements in all-modality models after post-training alignment, we construct a challenging all-modality capability evaluation framework -- eval-anything. All data, models, and code frameworks have been open-sourced for the community. For more details, please refer to https://github.com/PKU-Alignment/align-anything.

Paper Structure

This paper contains 72 sections, 4 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Composition and distribution of align-anything-200k. Our dataset comprises 8 subtasks across text, image, audio, and video modalities. Each modality exhibits distinct semantic features and distribution patterns, covering various latent spaces. This highlights that all-modality alignment cannot rely solely on data from specific modalities; rather, it requires the integration of data across modalities.
  • Figure 2: All-modality preference and language feedback annotation of align-anything-200k. For all-modality preference annotation, we classify the instruction-following metrics into two categories: modality-agnostic and modality-specific. Each fine-grained dimension is assigned a corresponding score along with a rationale. Additionally, we offer detailed language feedback, including critiques and refinement suggestions, which integrate information from multiple modalities within the responses.
  • Figure 3: Learning from language feedback pipeline: (1). Feedback Modeling. We perform SFT on the initial model using annotated language feedback. (2). Self Improving. The initial model optimizes responses given the language feedback to synthesize preference pairs.
  • Figure 4: Comparison of DPO+LLF with DPO on varying language feedback amounts. We trained the feedback models using 25%, 50%, and 75% of the language feedback (LF) compared to binary feedback (BF), then synthesized an equal amount of preference pairs based on them, and subsequently compared the performance of the DPO against the initial model. We find that a small amount of language feedback can synthesize preference pairs that surpass those derived from binary feedback.
  • Figure 5: The eval-anything benchmark consists of two components: (Up) AMU: All-Modality Understanding, where the model answers open-ended questions by integrating textual instructions, images, videos, and audio. (Down) AMG: All-Modality Generation is divided into subtasks of instruction-following, modality selection, and synergy. The model generates outputs for each modality (text, image, video, audio) based on instructions, with human-preferred combinations guiding modality selection metrics. A trained judge model evaluates the relevance, consistency, and synergy across different modalities in the outputs.
  • ...and 10 more figures