Table of Contents
Fetching ...

POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, Jun Liu

TL;DR

POPEN tackles hallucinations and imprecise segmentation in LVLM-based reasoning segmentation by aligning outputs with human preferences through a task-specific preference optimization and a preference-based ensemble. It introduces two data streams, text semantics preference $P_t$ and segmentation embedding preference $P_s$, and two losses $L_t$ and $L_s$ with $L_{ ext{pre}} = L_t + L_s$, to jointly improve text alignment and segmentation accuracy. The approach builds on PixelLM with an LVLM backbone and a CLIP-based vision encoder, and evaluates on benchmarks such as MUSE, RefCOCO, GranD_f, and ReasonSeg, achieving state-of-the-art $gIoU$ and $cIoU$ while reducing hallucinations. The results demonstrate practical impact by delivering more reliable segmentation under complex natural-language instructions, with a moderate inference-time overhead due to the ensemble.

Abstract

Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM. Project page is https://lanyunzhu.site/POPEN/

POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

TL;DR

POPEN tackles hallucinations and imprecise segmentation in LVLM-based reasoning segmentation by aligning outputs with human preferences through a task-specific preference optimization and a preference-based ensemble. It introduces two data streams, text semantics preference and segmentation embedding preference , and two losses and with , to jointly improve text alignment and segmentation accuracy. The approach builds on PixelLM with an LVLM backbone and a CLIP-based vision encoder, and evaluates on benchmarks such as MUSE, RefCOCO, GranD_f, and ReasonSeg, achieving state-of-the-art and while reducing hallucinations. The results demonstrate practical impact by delivering more reliable segmentation under complex natural-language instructions, with a moderate inference-time overhead due to the ensemble.

Abstract

Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM. Project page is https://lanyunzhu.site/POPEN/

Paper Structure

This paper contains 24 sections, 10 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: An example of hallucination in text responses and inaccurate segmentation results in existing LVLM-based reasoning segmentation methods. In this example, the LVLM generates the non-existent apple in the text response. The segmentation results show rough edges (grapes) or incorrect localization (misidentifying part of the area belonging to the cup as an orange).
  • Figure 2: Illustration of (a) preference data collection and (b) preference optimization method in our POPEN framework.
  • Figure 3: Illustration of preference-based ensemble. For simplify of illustration, in this figure, the number $K$ of the generated responses is 2, the number $N$ of segmentation targets is 1.
  • Figure 4: Comparative examples of text responses and segmentation results between PixelLM and our POPEN.
  • Figure 5: Performance on the grounded conversation generation (GCG) task of GranD$_f$ Dataset. Metrics include METEOR (M), CIDEr (C), AP50, mIoU, and Mask Recall. Our POPEN achieves the best performance.
  • ...and 2 more figures