Table of Contents
Fetching ...

SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner

Kejia Chen, Jiawen Zhang, Jiacong Hu, Jiazhen Yang, Jian Lou, Zunlei Feng, Mingli Song

TL;DR

The paper tackles the challenge of aligning LVLM outputs with human preferences while avoiding costly human-annotated data. It introduces SHAPE, a self-supervised framework that constructs holistic winner-loser preference triplets from existing image-text pairs by applying multiple image augmentations and a summarization step, followed by iterative Direct Preference Optimization fine-tuning. SHAPE demonstrates substantial gains across 12 benchmarks and several model sizes (7B and 13B), outperforming prior augmentation-based methods and reducing reliance on labeled data. Qualitative analyses indicate improved attention to visual details and more human-aligned holistic descriptions, suggesting SHAPE enhances robustness and reduces hallucinations in multimodal generation. This approach promises practical benefits for deploying LVLMs in real-world settings by lowering annotation costs while improving alignment quality.

Abstract

Large Visual Language Models (LVLMs) increasingly rely on preference alignment to ensure reliability, which steers the model behavior via preference fine-tuning on preference data structured as ``image - winner text - loser text'' triplets. However, existing approaches often suffer from limited diversity and high costs associated with human-annotated preference data, hindering LVLMs from fully achieving their intended alignment capabilities. We present \projectname, a self-supervised framework capable of transforming the already abundant supervised text-image pairs into holistic preference triplets for more effective and cheaper LVLM alignment, eliminating the need for human preference annotations. Our approach facilitates LVLMs in progressively enhancing alignment capabilities through iterative self-improvement. The key design rationale is to devise preference triplets where the winner text consistently improves in holisticness and outperforms the loser response in quality, thereby pushing the model to ``strive to the utmost'' of alignment performance through preference fine-tuning. For each given text-image pair, SHAPE introduces multiple visual augmentations and pairs them with a summarized text to serve as the winner response, while designating the original text as the loser response. Experiments across \textbf{12} benchmarks on various model architectures and sizes, including LLaVA and DeepSeek-VL, show that SHAPE achieves significant gains, for example, achieving +11.3\% on MMVet (comprehensive evaluation), +1.4\% on MMBench (general VQA), and +8.0\% on POPE (hallucination robustness) over baselines in 7B models. Notably, qualitative analyses confirm enhanced attention to visual details and better alignment with human preferences for holistic descriptions.

SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner

TL;DR

The paper tackles the challenge of aligning LVLM outputs with human preferences while avoiding costly human-annotated data. It introduces SHAPE, a self-supervised framework that constructs holistic winner-loser preference triplets from existing image-text pairs by applying multiple image augmentations and a summarization step, followed by iterative Direct Preference Optimization fine-tuning. SHAPE demonstrates substantial gains across 12 benchmarks and several model sizes (7B and 13B), outperforming prior augmentation-based methods and reducing reliance on labeled data. Qualitative analyses indicate improved attention to visual details and more human-aligned holistic descriptions, suggesting SHAPE enhances robustness and reduces hallucinations in multimodal generation. This approach promises practical benefits for deploying LVLMs in real-world settings by lowering annotation costs while improving alignment quality.

Abstract

Large Visual Language Models (LVLMs) increasingly rely on preference alignment to ensure reliability, which steers the model behavior via preference fine-tuning on preference data structured as ``image - winner text - loser text'' triplets. However, existing approaches often suffer from limited diversity and high costs associated with human-annotated preference data, hindering LVLMs from fully achieving their intended alignment capabilities. We present \projectname, a self-supervised framework capable of transforming the already abundant supervised text-image pairs into holistic preference triplets for more effective and cheaper LVLM alignment, eliminating the need for human preference annotations. Our approach facilitates LVLMs in progressively enhancing alignment capabilities through iterative self-improvement. The key design rationale is to devise preference triplets where the winner text consistently improves in holisticness and outperforms the loser response in quality, thereby pushing the model to ``strive to the utmost'' of alignment performance through preference fine-tuning. For each given text-image pair, SHAPE introduces multiple visual augmentations and pairs them with a summarized text to serve as the winner response, while designating the original text as the loser response. Experiments across \textbf{12} benchmarks on various model architectures and sizes, including LLaVA and DeepSeek-VL, show that SHAPE achieves significant gains, for example, achieving +11.3\% on MMVet (comprehensive evaluation), +1.4\% on MMBench (general VQA), and +8.0\% on POPE (hallucination robustness) over baselines in 7B models. Notably, qualitative analyses confirm enhanced attention to visual details and better alignment with human preferences for holistic descriptions.

Paper Structure

This paper contains 12 sections, 6 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comprehensive evaluation of SHAPE 's enhanced performance against SOTA models across multimodal benchmarks: Achieving +11.3% on MMVet, +1.4% on MMBench, and +8.0% on POPE over baselines in 7B Models.
  • Figure 2: The winner answer $y_{win}$ through attention visualization indicates how SHAPE enables holistic caption generation. And $y_{lose}$ is the original generation. The green text represents correctly recognized content, while the red text represents incorrect recognition.
  • Figure 3: Overview of SHAPE: Unlike traditional SFT, which relies on single-path supervised learning with human preference intervention (e.g., (a) SFT), or simple augmentation methods (e.g., (b) SeVa), SHAPE fully leverages the potential of LVLMs by extending the self-supervised optimization paradigm to Visual Question Answering (VQA). It enriches image-side understanding to generate holistic training signals, enabling more reliable and detailed visual comprehension without requiring additional annotations.
  • Figure 4: Evaluation of SHAPE , SeVa, CSR, and LLaVA-1.5 (7B and 13B) on MMVet, comparing win rates and average output lengths, with GPT-4 as the judge for visual-language task performance.