Table of Contents
Fetching ...

Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution

Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding

TL;DR

SENA presents an annotation-free framework for aligning multimodal large language models through iterative self-evolution. It leverages three mechanisms—image-driven self-questioning, answer self-enhancement, and image-content alignment—within a Direct Preference Optimization regime to generate discriminative preference data from unlabeled images. By using original and diffusion-noised images, plus descriptive prompts, SENA achieves competitive performance on both generative and discriminative benchmarks while minimizing hallucinations. The approach scales with unlabeled data and demonstrates generality across base models, offering a practical path to scalable alignment in multimodal systems.

Abstract

Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly. A promising solution is the self-evolution strategy, where models are iteratively trained on data they generate. However, current techniques still rely on human- or GPT-annotated data and sometimes require additional models or ground truth answers. To address these issues, we propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers using only unannotated images. First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content, regenerating them if they are irrelevant or unanswerable. This sets a strong foundation for answer generation. Second, we introduce an answer self-enhancement technique, starting with image captioning to improve answer quality. We also use corrupted images to generate rejected answers, forming distinct preference pairs for optimization. Finally, we incorporate an image content alignment loss function alongside Direct Preference Optimization (DPO) loss to reduce hallucinations, ensuring the model focuses on image content. Experiments show that our framework performs competitively with methods using external information, offering a more efficient and scalable approach to MLLMs.

Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution

TL;DR

SENA presents an annotation-free framework for aligning multimodal large language models through iterative self-evolution. It leverages three mechanisms—image-driven self-questioning, answer self-enhancement, and image-content alignment—within a Direct Preference Optimization regime to generate discriminative preference data from unlabeled images. By using original and diffusion-noised images, plus descriptive prompts, SENA achieves competitive performance on both generative and discriminative benchmarks while minimizing hallucinations. The approach scales with unlabeled data and demonstrates generality across base models, offering a practical path to scalable alignment in multimodal systems.

Abstract

Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly. A promising solution is the self-evolution strategy, where models are iteratively trained on data they generate. However, current techniques still rely on human- or GPT-annotated data and sometimes require additional models or ground truth answers. To address these issues, we propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers using only unannotated images. First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content, regenerating them if they are irrelevant or unanswerable. This sets a strong foundation for answer generation. Second, we introduce an answer self-enhancement technique, starting with image captioning to improve answer quality. We also use corrupted images to generate rejected answers, forming distinct preference pairs for optimization. Finally, we incorporate an image content alignment loss function alongside Direct Preference Optimization (DPO) loss to reduce hallucinations, ensuring the model focuses on image content. Experiments show that our framework performs competitively with methods using external information, offering a more efficient and scalable approach to MLLMs.

Paper Structure

This paper contains 23 sections, 4 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparisons between (a) traditional framework and (b) our framework. Our framework combines carefully designed prompt mechanisms and an alignment function, completely eliminating the reliance on annotated data and additional models.
  • Figure 2: Illustration of the Image-Driven Self-Questioning. SQ checks whether $q_{gen}$ can be answered based on the content of the image. If it cannot, a new question relevant to the image content is generated. The majority of poor-quality questions can be transformed into reliable ones through just one check. Best viewed by zooming in.
  • Figure 3: Illustration of the Answer Self-Enhancement techniques. SE analyzes the previous question-and-answer pairs with the help of the image description and enhances the responses. The values in parentheses represent the CLIP scores of the answer-image pairs, which we use to indicate the quality of the answers. Best viewed by zooming in.
  • Figure 4: Comparison of outputs from various models on different visual tasks in MMHal-Bench. Best viewed in color.
  • Figure 5: The General Descriptive Prompt Set $P_{des}$.
  • ...and 1 more figures