Table of Contents
Fetching ...

Mitigating Object Hallucination via Concentric Causal Attention

Yun Xing, Yiheng Li, Ivan Laptev, Shijian Lu

TL;DR

This work proposes Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy that mitigates the impact of RoPE long-term decay in LVLMs by naturally reducing relative distance between visual and instruction tokens.

Abstract

Recent Large Vision Language Models (LVLMs) present remarkable zero-shot conversational and reasoning capabilities given multimodal queries. Nevertheless, they suffer from object hallucination, a phenomenon where LVLMs are prone to generate textual responses not factually aligned with image inputs. Our pilot study reveals that object hallucination is closely tied with Rotary Position Encoding (RoPE), a widely adopted positional dependency modeling design in existing LVLMs. Due to the long-term decay in RoPE, LVLMs tend to hallucinate more when relevant visual cues are distant from instruction tokens in the multimodal input sequence. Additionally, we observe a similar effect when reversing the sequential order of visual tokens during multimodal alignment. Our tests indicate that long-term decay in RoPE poses challenges to LVLMs while capturing visual-instruction interactions across long distances. We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy that mitigates the impact of RoPE long-term decay in LVLMs by naturally reducing relative distance between visual and instruction tokens. With CCA, visual tokens can better interact with instruction tokens, thereby enhancing model's perception capability and alleviating object hallucination. Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks.

Mitigating Object Hallucination via Concentric Causal Attention

TL;DR

This work proposes Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy that mitigates the impact of RoPE long-term decay in LVLMs by naturally reducing relative distance between visual and instruction tokens.

Abstract

Recent Large Vision Language Models (LVLMs) present remarkable zero-shot conversational and reasoning capabilities given multimodal queries. Nevertheless, they suffer from object hallucination, a phenomenon where LVLMs are prone to generate textual responses not factually aligned with image inputs. Our pilot study reveals that object hallucination is closely tied with Rotary Position Encoding (RoPE), a widely adopted positional dependency modeling design in existing LVLMs. Due to the long-term decay in RoPE, LVLMs tend to hallucinate more when relevant visual cues are distant from instruction tokens in the multimodal input sequence. Additionally, we observe a similar effect when reversing the sequential order of visual tokens during multimodal alignment. Our tests indicate that long-term decay in RoPE poses challenges to LVLMs while capturing visual-instruction interactions across long distances. We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy that mitigates the impact of RoPE long-term decay in LVLMs by naturally reducing relative distance between visual and instruction tokens. With CCA, visual tokens can better interact with instruction tokens, thereby enhancing model's perception capability and alleviating object hallucination. Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks.

Paper Structure

This paper contains 20 sections, 4 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Long-term decay of RoPE su2024roformer in Large Vision Language Models (LVLMs). (a) a schematic view of inference in LVLMs, typically involving a pre-trained vision encoder, a large language model and a projector to map visual tokens to textual space. For each of $V$ visual tokens $\mathbb{S}_{vision}$, we aggregate its information flow to instruction tokens $\mathbb{S}_{instruct}$ and reshape the aggregation results to 2-D ($\sqrt{V}$ by $\sqrt{V}$). Applying RoPE on visual tokens introduces long-term decay as illustrated in (c), referring to the phenomenon where information flowing from visual tokens to instruction tokens gradually decays from lower-right region (rightmost visual tokens in the 1-D sequence) to upper-left region (leftmost visual tokens). For instruction tokens, they have much less direct interaction with leftmost visual tokens as compared with rightmost visual tokens, leading to inferior multimodal alignment in the trained LVLMs. (b) and (c) are derived from the adversarial subset of the $3k$ POPE li2023evaluating image-instruction pairs. Best viewed in color.
  • Figure 2: Motivation Experiment. Given an image $I_v$ with object $O_v$, we crop $O_v$ and paste it to various spatial positions $\{v_1,...,v_k\}$ within a pre-defined template. For every pasting position, we ask two LVLMs ($\mathcal{F}_b$ and $\mathcal{F}_r$) if object $O_v$ is in this template, where $\mathcal{F}_b$ refers to a baseline model that follows raster-scan positional alignment strategy and $\mathcal{F}_r$ refers to a model that resorts to reversal raster-scan position alignment strategy. The total number of correct responses at different pasting positions $\{v_1,...,v_k\}$ is reported in (a) and (b), which refers to results from model $\mathcal{F}_b$ and $\mathcal{F}_r$, respectively. We observe that LVLM $\mathcal{F}_b$ are more likely to generate correct responses when pasting object $O_v$ to lower region, while $\mathcal{F}_r$ are less hallucinated when pasting object $O_v$ to upper region. Pasting positions with the most and the least correct responses are highlighted in solid-line and dotted-line red boxes. More details are provided in Appendix \ref{['appendix:motivation']}. Best viewed in color.
  • Figure 3: An overview for Concentric Causal Attention. Left: Visual Token Re-organization. In comparison to raster-scan positional alignment in (a), we design concentric position alignment in (b) which shortens visual-instruction distance and retains spatial locality for 2-D data like images. Right: Concentric Causal Masking. By default as in (c), a visual token attends to all preceding visual tokens in a 1-D sequence. In contrast, our concentric causal attention in (d) models 2-D continuous positional dependencies among visual tokens, where center visual tokens attend to peripheral ones. Causal masks are $V$ by $V$ where in this case $V$ is $36$ for demonstration purpose. Best viewed in color.
  • Figure 4: RoPE in LLaMA. A schematic view for LLaMA where RoPE is highlighted, and an example illustration on how RoPE is applied over query or key feature. We use a short input sequence with length of 4 and feature dimension of 4 for demonstration purpose. Input tokens are rotated with angles, subject to token positions. For mathematical definition, please refer to Sec. \ref{['section:motivation']}.
  • Figure 5: Workflow illustration on how we synthesize testing data. Given an image and box annotation for one object instance, we crop it and paste it on a template image, initialized with ImageNet mean pixel values. We paste every cropped region on every spatial position. Resulting data constitutes a large amount of questions about object existence, diverse in spatial positions.
  • ...and 5 more figures