Table of Contents
Fetching ...

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang

TL;DR

This work proposes LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG), which introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model.

Abstract

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io).

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

TL;DR

This work proposes LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG), which introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model.

Abstract

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io).

Paper Structure

This paper contains 38 sections, 10 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of LACING, a systemic framework, which consists of Multimodal Dual Attention (bottom) and Soft-Image Guidance (above) to mitigate language bias of LVLMs. MDA proposes a parallel dual-attention mechanism that constructs two separate attention for visual and text inputs during both training and inference. SIG implements a learnable soft visual prompt during training to replace visual inputs. This soft prompt serves to maintain input patterns while compelling models to prioritize text inputs during inference.
  • Figure 2: Average attention scores for output tokens towards text and visual tokens across different layers in LLaVA-1.5 LLAVA15, showing that only the first two layers apply considerable attention to visual tokens. In contrast, deeper layers largely neglect them.
  • Figure 3: Comparison of attention allocation between a standard LVLM (LLaVA-1.5) and our model trained with the Multimodal Dual-Attention (MDA) mechanism. Text tokens and visual tokens are indicated in blue and purple, respectively, in the sidebar.
  • Figure 4: Attention allocation of LVLMs to visual and text tokens. Attention to visual tokens (a) decreases as response generation progresses, while attention to text tokens (b) increases.
  • Figure 5: Model performance on LLaVABench across various scaling parameter $\lambda$.
  • ...and 9 more figures