Table of Contents
Fetching ...

Hijacking Context in Large Multi-modal Models

Joonhyun Jeong

TL;DR

This work identifies context hijacking in large multi-modal models, where a minority of incoherent image-text pairs can derail responses from the original context. It proposes a pre-filtering approach using GPT-4V to remove irrelevant contexts and investigates replacing hijacked contexts with correlated ones via large foundation models and diffusion-based image generation. The study provides qualitative evidence that GPT-4V is robust to distribution shifts but reforming hijacked contexts alone does not fully restore coherence, prompting further research into more reliable filtering and context-reformation techniques. The findings have practical implications for the reliability of LMMs in real-world use where noisy or conflicting prompts are common.

Abstract

Recently, Large Multi-modal Models (LMMs) have demonstrated their ability to understand the visual contents of images given the instructions regarding the images. Built upon the Large Language Models (LLMs), LMMs also inherit their abilities and characteristics such as in-context learning where a coherent sequence of images and texts are given as the input prompt. However, we identify a new limitation of off-the-shelf LMMs where a small fraction of incoherent images or text descriptions mislead LMMs to only generate biased output about the hijacked context, not the originally intended context. To address this, we propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts. We further investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.

Hijacking Context in Large Multi-modal Models

TL;DR

This work identifies context hijacking in large multi-modal models, where a minority of incoherent image-text pairs can derail responses from the original context. It proposes a pre-filtering approach using GPT-4V to remove irrelevant contexts and investigates replacing hijacked contexts with correlated ones via large foundation models and diffusion-based image generation. The study provides qualitative evidence that GPT-4V is robust to distribution shifts but reforming hijacked contexts alone does not fully restore coherence, prompting further research into more reliable filtering and context-reformation techniques. The findings have practical implications for the reliability of LMMs in real-world use where noisy or conflicting prompts are common.

Abstract

Recently, Large Multi-modal Models (LMMs) have demonstrated their ability to understand the visual contents of images given the instructions regarding the images. Built upon the Large Language Models (LLMs), LMMs also inherit their abilities and characteristics such as in-context learning where a coherent sequence of images and texts are given as the input prompt. However, we identify a new limitation of off-the-shelf LMMs where a small fraction of incoherent images or text descriptions mislead LMMs to only generate biased output about the hijacked context, not the originally intended context. To address this, we propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts. We further investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
Paper Structure (16 sections, 3 equations, 4 figures)

This paper contains 16 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: Hijacking context confuses LMMs to only generate responses with incoherent contents. (top): Given a sequence of visual and textual story plot 1$\sim$4, LMM reasonably outputs a coherent caption for the final image. (bottom): When a single pair of incoherent image and caption (highlighted with red) is appended to the context, LMM only tells about the hijacked context ($i.e.,$football game), disregarding all the aforementioned context of visual story plots ($i.e.,$emergency situation).
  • Figure 2: Effect of location for the hijacking context. We ablated to insert the hijacking context in between the sequences 1$\sim$5 and visualize the LMM's response with regard to the query image.
  • Figure 3: Effect of reformed context on the LMM response. We replaced the hijacked context with the one reformed by GPT-4 and DALLE-3.
  • Figure 4: Reforming hijacked text and images via GPT-4 and DALLE-3. Given multiple image-text pairs of a visual story plot from VIST dataset vist, GPT-4 is instructed to replace any irrelevant images or text descriptions with coherent ones. Subsequently, DALLE-3 is instructed to generate an image corresponding to the newly reformed text, under consideration of the underlying context.