Table of Contents
Fetching ...

Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, Yapeng Tian

TL;DR

This paper tackles the resource burden and bias risks of model-level verifiers in multimodal LLM self-improvement by proposing a judge-free framework that combines controllable hallucination for data generation, a lightweight CLIP-based verifier for quality control, and Direct Preference Optimization for training. The approach eliminates the need for large external verifiers, achieving favorable precision and recall with substantially reduced computation, as demonstrated on public benchmarks and a new IC dataset designed to stress hallucination control. Key innovations include a controllable negative/positive sample generation mechanism, CLIP-score based data inversion, and a Gaussian sampling strategy for diverse training pairs. Collectively, the method enables scalable, efficient self-improvement for MLLMs with practical implications for deploying robust, hallucination-controlled vision-language systems.

Abstract

Self-improvement in multimodal large language models (MLLMs) is crucial for enhancing their reliability and robustness. However, current methods often rely heavily on MLLMs themselves as judges, leading to high computational costs and potential pitfalls like reward hacking and model collapse. This paper introduces a novel, model-level judge-free self-improvement framework. Our approach employs a controlled feedback mechanism while eliminating the need for MLLMs in the verification loop. We generate preference learning pairs using a controllable hallucination mechanism and optimize data quality by leveraging lightweight, contrastive language-image encoders to evaluate and reverse pairs when necessary. Evaluations across public benchmarks and our newly introduced IC dataset designed to challenge hallucination control demonstrate that our model outperforms conventional techniques. We achieve superior precision and recall with significantly lower computational demands. This method offers an efficient pathway to scalable self-improvement in MLLMs, balancing performance gains with reduced resource requirements.

Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

TL;DR

This paper tackles the resource burden and bias risks of model-level verifiers in multimodal LLM self-improvement by proposing a judge-free framework that combines controllable hallucination for data generation, a lightweight CLIP-based verifier for quality control, and Direct Preference Optimization for training. The approach eliminates the need for large external verifiers, achieving favorable precision and recall with substantially reduced computation, as demonstrated on public benchmarks and a new IC dataset designed to stress hallucination control. Key innovations include a controllable negative/positive sample generation mechanism, CLIP-score based data inversion, and a Gaussian sampling strategy for diverse training pairs. Collectively, the method enables scalable, efficient self-improvement for MLLMs with practical implications for deploying robust, hallucination-controlled vision-language systems.

Abstract

Self-improvement in multimodal large language models (MLLMs) is crucial for enhancing their reliability and robustness. However, current methods often rely heavily on MLLMs themselves as judges, leading to high computational costs and potential pitfalls like reward hacking and model collapse. This paper introduces a novel, model-level judge-free self-improvement framework. Our approach employs a controlled feedback mechanism while eliminating the need for MLLMs in the verification loop. We generate preference learning pairs using a controllable hallucination mechanism and optimize data quality by leveraging lightweight, contrastive language-image encoders to evaluate and reverse pairs when necessary. Evaluations across public benchmarks and our newly introduced IC dataset designed to challenge hallucination control demonstrate that our model outperforms conventional techniques. We achieve superior precision and recall with significantly lower computational demands. This method offers an efficient pathway to scalable self-improvement in MLLMs, balancing performance gains with reduced resource requirements.

Paper Structure

This paper contains 19 sections, 3 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: Comparison of three different improvement paradigms. (a) The conventional improvement paradigm requires humans to annotate feedback data and feed it into the model for improvement, making it the least efficient approach. (b) The self-improvement paradigm leverages the model itself to provide feedback; however, this approach is still inefficient due to the high cost and potential bias of using large models as verifiers. (c) Our efficient self-improvement paradigm improves the model without human feedback or model-level self-feedback by using a predefined data generation strategy combined with a lightweight verifier, achieving both efficiency and performance improvement. (d) Among all three paradigms, efficient self-improvement offers the best trade-off between performance and cost.
  • Figure 2: Overview of our framework. Our efficient self-improvement framework combines two main strategies: (a) We use a simple yet effective predefined preference dataset generation approach, employing two decoding paths during response generation. By adjusting the hallucination ratio $h_\text{ratio}$, we can control whether a negative or positive sample is generated for preference learning. (b) After the initial preferences are generated, we use a lightweight contrastive language-image pretrained encoder to calculate the average sentence-level $\text{CLIP}\_\text{score}$ difference between the initial positive and negative samples, swapping them when necessary to ensure the quality of the final preference dataset. (c) Finally, we apply DPO with the resulting dataset to improve the model.
  • Figure 3: Image reconstruction examples. To further demonstrate the effectiveness of our training framework, we use a text-to-image diffusion model, DALL·E 3, to convert captions generated by the models back into images. The reconstructed image from the original model's caption contains significant hallucination, while increasing the $h_\text{ratio}$ during generation produces a negative caption that, when reconstructed, shows even more hallucination in attributes like style and emotion. However, after training with these generated caption pairs, the reconstructed image from the improved model’s caption closely resembles the original, surpassing both the positive and negative samples.
  • Figure 4: Performance comparison of DPO training using various $\text{CLIP}\_\text{score}$ differences generated with $h_\text{ratio}=0.0$ and $h_\text{ratio}=0.1$, ranked from low to high. The best performance is highlighted.
  • Figure 5: Performance comparison of DPO training using various $\text{CLIP}\_\text{score}$ differences generated with $h_\text{ratio}$ sampled from a uniform distribution and ranked from low to high. The best performance is highlighted.
  • ...and 10 more figures