Table of Contents
Fetching ...

Continual SFT Matches Multimodal RLHF with Negative Supervision

Ke Zhu, Yu Wang, Yanpeng Sun, Qiang Chen, Jiangjiang Liu, Gang Zhang, Jingdong Wang

TL;DR

The nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss, which is more memory efficient than multimodal RLHF where 2 or 4 large VLMs are strictly required.

Abstract

Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. We hope this paper will stimulate further research to properly align large vision language models.

Continual SFT Matches Multimodal RLHF with Negative Supervision

TL;DR

The nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss, which is more memory efficient than multimodal RLHF where 2 or 4 large VLMs are strictly required.

Abstract

Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. We hope this paper will stimulate further research to properly align large vision language models.

Paper Structure

This paper contains 26 sections, 26 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Standard DPO training pipeline (in the first row) and our proposed nSFT (in the second row).
  • Figure 2: \ref{['fig:method-train-time']}-\ref{['fig:method-gpu-memory']} shows training time, GPU memory of LLaVa-1.5 ('Base'), 3 DPO techniques: SeVa seva, SIMA sima and CSR csr, 1 PPO technique LLaVa-RLHF. \ref{['fig:method-circle']} shows multimodal results of averaged DPO methods, pure continual (Cont.) SFT and our nSFT.
  • Figure 3: Visualization of a standard DPO pipeline and our nSFT method. In DPO, GT annotations usually directly serves as chosen responses bpoself-play. During nSFT, we ask an LLM to first identifies the specific error (red) part in the rejected response, by referring to chosen responses and the numerated error in the vision error codebook (cf. appendix). The LLM then constructs a conversation talking about this image that help the model avoid making such mistakes (e.g., correct answers are blue coded). This figure is best viewed in color.
  • Figure 4: Visualization of benchmarks results (relative improvement over baseline LLaVA-1.5-7B) using continual SFT, DPO and our nSFT (cf. \ref{['fig:main-improve:vqa']}-\ref{['fig:main-improve:hallucination']}). Here we choose LLaVA-150k as the data source and randomly choose 2.5k, 5k, 7.5k and 10k for model alignment. In \ref{['fig:main-improve:dataset']}, we visualize the averaged results (over 9 benchmarks shown in Table \ref{['tab:main-improvement']}), and analyzing the effect of dataset choices.
  • Figure 5: Visualizations of response generated from LLaVA-v1.5-7B, DPO and nSFT models. In the left part, the correct content are emphasized with purple (for DPO) and blue (for nSFT), while error content are highlighted with red. This figure is best viewed in color.
  • ...and 6 more figures