Table of Contents
Fetching ...

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Dosung Lee, Sangwon Jung, Boyoung Kim, Minyoung Kim, Sungyeon Kim, Junyoung Sung, Paul Hongsuck Seo

TL;DR

This work identifies a bias in existing Multimodal Knowledge-Based Visual Question Answering benchmarks where models rely on visual shortcuts tied to the main document entity. It introduces RETINA, an automated, LLM-guided benchmark that uses related entities to remove shortcuts, and MIMIR, a multimodal retriever that augments documents with multiple related-entity images and entity-aware representations. Across RETINA and traditional benchmarks, MIMIR substantially improves retrieval and answer quality, demonstrating the importance of multi-image context and explicit entity alignment for robust MKB-VQA. The findings highlight the need for more realistic evaluation settings and propose a practical approach to bridge the gap between benchmark performance and real-world multimodal reasoning.

Abstract

Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from "visual shortcuts", as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

TL;DR

This work identifies a bias in existing Multimodal Knowledge-Based Visual Question Answering benchmarks where models rely on visual shortcuts tied to the main document entity. It introduces RETINA, an automated, LLM-guided benchmark that uses related entities to remove shortcuts, and MIMIR, a multimodal retriever that augments documents with multiple related-entity images and entity-aware representations. Across RETINA and traditional benchmarks, MIMIR substantially improves retrieval and answer quality, demonstrating the importance of multi-image context and explicit entity alignment for robust MKB-VQA. The findings highlight the need for more realistic evaluation settings and propose a practical approach to bridge the gap between benchmark performance and real-world multimodal reasoning.

Abstract

Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from "visual shortcuts", as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.

Paper Structure

This paper contains 25 sections, 2 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Comparison of Previous MKB-VQA Benchmarks and RETINA. (a) In previous benchmarks, the query image correspond to the same visual entity (highlighted in red) as (c) the shared GT document, creating a visual shortcut. (b) In RETINA, the query uses a different visual entity (highlighted in green), breaking the visual shortcut.
  • Figure 2: Preliminary Experiment on Existing Benchmarks. Recall@$k$ on benchmarks such as Infoseek chen2023can and EVQA mensink2023encyclopedic, which exhibit visual shortcuts deng-etal-2025-muka. Models fine-tuned with image-only (blue) queries versus image+text (green) queries.
  • Figure 3: RETINA Benchmark Generation Pipeline. (a) constructs one-hop neighborhood graphs by extracting named entities (gray box) related to the answer entity (white box) and their relations using an LLM bai2023qwen; (b) samples an query entity (green) and qualifying entity (teal) to form a target subgraph with the answer entity (red); and (c) feeds the target subgraph into an LLM to generate a textual query and collect a corresponding image from M2KR Images lin2024preflmr. The query is then paraphrased to minimize lexical overlap with the document.
  • Figure 4: Overview of MIMIR Document Encoder Architecture. (a) given a document, related named entities are identified with an LLM bai2023qwen, and corresponding images are collected from the KB; (b) textual, global image, and patch-level features are extracted, with patch features attending to textual features through cross-attention to yield multimodal features; and (c) entity token embeddings are incorporated into the textual features prior to cross-attention for richer contextualization; (d) the final document embedding jointly integrates textual, global, and multimodal features projected into the same embedding space.
  • Figure 5: Analysis of Visual Shortcuts on RETINA. MIMIR (green) and MuKA (red) curves show recall for distractor (dotted) and GT (solid) documents. Distractors are non-GT documents whose main entity matches the query image’s entity.
  • ...and 11 more figures