Table of Contents
Fetching ...

Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy

Yinan Sun, Xiongkuo Min, Zicheng Zhang, Yixuan Gao, Yuqin Cao, Guangtao Zhai

TL;DR

This work tackles hallucinations in Low-level Visual Perception and Understanding (HLPU) by emphasizing model self-awareness. It introduces the HLPU instruction database (~200K samples) and a test benchmark LLSAVisionQA to quantify self-awareness in low-level tasks. The SAFEQA architecture augments standard vision-language models with image, salient-region, and quality features, while the ESA-PO framework uses Plackett-Luce-based preference optimization including an 'I don't know' option to calibrate knowledge boundaries. Across extensive experiments, SAFEQA with ESA-PO improves both accuracy and self-awareness, surpassing several close-source baselines in low-level vision QA and judging responses more reliably.

Abstract

The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics.

Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy

TL;DR

This work tackles hallucinations in Low-level Visual Perception and Understanding (HLPU) by emphasizing model self-awareness. It introduces the HLPU instruction database (~200K samples) and a test benchmark LLSAVisionQA to quantify self-awareness in low-level tasks. The SAFEQA architecture augments standard vision-language models with image, salient-region, and quality features, while the ESA-PO framework uses Plackett-Luce-based preference optimization including an 'I don't know' option to calibrate knowledge boundaries. Across extensive experiments, SAFEQA with ESA-PO improves both accuracy and self-awareness, surpassing several close-source baselines in low-level vision QA and judging responses more reliably.

Abstract

The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics.

Paper Structure

This paper contains 26 sections, 14 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The self-awareness of models in low-level vision tasks. A reliable model should be able to accurately recognize what it knows and what it does not know. It should provide correct answers to questions within its basic knowledge, while declining to answer questions that fall outside of its basic knowledge.
  • Figure 2: HLPU instruction database construction pipeline. First, we select 76K available data as the original data and the subjective experimental process is shown in (a) Preparation of database. Secondly, we use GPT to convert the database into 200K instruction-response pairs and provide as detailed answers as possible for each question, as shown in (b) Visual question answering (VQA). Finally, we use GPT to generate multimodal preference data, which are used for low-level visual preference optimization training strategy, as shown in (c) Multimodal preference data.
  • Figure 3: The composition of the HLPU instruction database, in which the 200K instruction-response pairs include (a) Pathway reasoning and extended conversations, (b) "What" questions, (c) "Yes-or-No" questions and (d) "How" questions. The blue part of the responses indicates refusal and the red part of the responses indicates the hallucination.
  • Figure 4: The detailed architecture of the proposed model, which is composed of five modules. (a) The image feature extraction module extracts image features using SigLIP. (b) The salient region feature extraction module extracts salient features using Swin Transformer. (c) The quality feature extraction module extracts quality features using Swin Transformer. (d) The text tokenizer. (e) The LLM decoding module integrates features and outputs responses.
  • Figure 5: The detailed structure of the proposed training strategy. The traditional DPO method includes positive responses and negative responses. But in real-life situations, the response to the unknown question is often "I don't know". Our proposed method incorporates "I don't know" as a suboptimal response to enhance the self-awareness ability of the models.
  • ...and 1 more figures