Table of Contents
Fetching ...

LPOI: Listwise Preference Optimization for Vision Language Models

Fatemeh Pesaran Zadeh, Yoojin Oh, Gunhee Kim

TL;DR

This paper tackles the problem of aligning vision-language models with human preferences while mitigating hallucinations by introducing LPOI, a listwise preference optimization method that uses object-aware hard negatives and interpolated image lists. The approach identifies a critical object, masks it to create hard negatives, and automatically constructs a ranked list by progressively revealing the object through masking, optimizing with a listwise loss that respects the entire ranking alongside existing DPO and anchor losses. Empirical evaluation on MMHalBench, Object HalBench, and AMBER across multiple base models demonstrates that LPOI reduces hallucinations more effectively than DPO/mDPO and related baselines, with strong human evaluation corroborating improved factual grounding. The method achieves these gains without requiring additional annotations beyond standard pairwise preferences, and benefits from visual prompting and larger list sizes, offering a practical, scalable path to safer, more grounded multimodal alignment.

Abstract

Aligning large VLMs with human preferences is a challenging task, as methods like RLHF and DPO often overfit to textual information or exacerbate hallucinations. Although augmenting negative image samples partially addresses these pitfalls, no prior work has employed listwise preference optimization for VLMs, due to the complexity and cost of constructing listwise image samples. In this work, we propose LPOI, the first object-aware listwise preference optimization developed for reducing hallucinations in VLMs. LPOI identifies and masks a critical object in the image, and then interpolates the masked region between the positive and negative images to form a sequence of incrementally more complete images. The model is trained to rank these images in ascending order of object visibility, effectively reducing hallucinations while retaining visual fidelity. LPOI requires no extra annotations beyond standard pairwise preference data, as it automatically constructs the ranked lists through object masking and interpolation. Comprehensive experiments on MMHalBench, AMBER, and Object HalBench confirm that LPOI outperforms existing preference optimization methods in reducing hallucinations and enhancing VLM performance. We make the code available at https://github.com/fatemehpesaran310/lpoi.

LPOI: Listwise Preference Optimization for Vision Language Models

TL;DR

This paper tackles the problem of aligning vision-language models with human preferences while mitigating hallucinations by introducing LPOI, a listwise preference optimization method that uses object-aware hard negatives and interpolated image lists. The approach identifies a critical object, masks it to create hard negatives, and automatically constructs a ranked list by progressively revealing the object through masking, optimizing with a listwise loss that respects the entire ranking alongside existing DPO and anchor losses. Empirical evaluation on MMHalBench, Object HalBench, and AMBER across multiple base models demonstrates that LPOI reduces hallucinations more effectively than DPO/mDPO and related baselines, with strong human evaluation corroborating improved factual grounding. The method achieves these gains without requiring additional annotations beyond standard pairwise preferences, and benefits from visual prompting and larger list sizes, offering a practical, scalable path to safer, more grounded multimodal alignment.

Abstract

Aligning large VLMs with human preferences is a challenging task, as methods like RLHF and DPO often overfit to textual information or exacerbate hallucinations. Although augmenting negative image samples partially addresses these pitfalls, no prior work has employed listwise preference optimization for VLMs, due to the complexity and cost of constructing listwise image samples. In this work, we propose LPOI, the first object-aware listwise preference optimization developed for reducing hallucinations in VLMs. LPOI identifies and masks a critical object in the image, and then interpolates the masked region between the positive and negative images to form a sequence of incrementally more complete images. The model is trained to rank these images in ascending order of object visibility, effectively reducing hallucinations while retaining visual fidelity. LPOI requires no extra annotations beyond standard pairwise preference data, as it automatically constructs the ranked lists through object masking and interpolation. Comprehensive experiments on MMHalBench, AMBER, and Object HalBench confirm that LPOI outperforms existing preference optimization methods in reducing hallucinations and enhancing VLM performance. We make the code available at https://github.com/fatemehpesaran310/lpoi.

Paper Structure

This paper contains 32 sections, 3 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Comparison of preference optimization (PO) strategies for VLMs, with text and image negatives backgrounded in gray and orange, respectively. (a) DPO rafailov2024directpreferenceoptimizationlanguage: PO with text negatives. (b) mDPO wang2024mdpoconditionalpreferenceoptimization: DPO + PO using randomly cropped images as binary image negatives. (c) The proposed LPOI method: DPO + listwise PO with ranked image negatives, consisting of four samples: (1) the full image, (2) an image with the partial outfit, (3) an image with no outfit but some parts of person, and (4) an image with neither outfit nor person.
  • Figure 2: Overview of the LPOI framework. (1) Given an input image, prompt and corresponding set of chosen and rejected responses, we first compute $L_{DPO}$ and $L_{Anchor}$ using the response pairs similar to traditional DPO. (2) An object detection model and a VLM are employed to identify the most important object in the image. These objects are progressively masked in a sequence, with more visual clues being masked as the image deviates further from the positive example. (3) We optimize our model using this sequence of progressively masked images, which allows it to better differentiate between varying levels of hallucination, thereby improving its ability to discern subtle changes in visual context and generate responses more accurately grounded in the image.
  • Figure 3: Human evaluation results on a subset of the AMBER and Object HalBench benchmark. We compare responses generated by the Idefics-2B model fine-tuned using LPOI (ours), DPO, and mDPO.
  • Figure 4: Comparison of saliency maps with or without visual prompting (highlighted in red circle). Visual prompting shifts the model’s attention towards the masked area, guiding it to focus more on the region of interest. In the saliency maps, blue indicates low saliency, while red indicates high saliency.
  • Figure 5: MMHalBench results for different preference optimization methods trained on three different sizes of training sets.
  • ...and 4 more figures