Table of Contents
Fetching ...

When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA

Taeyun Roh, Eun-yeong Jo, Wonjune Jang, Jaewoo Kang

Abstract

Scientific figure multiple-choice question answering (MCQA) requires models to reason over diverse visual evidence, ranging from charts and multipanel figures to microscopy and biomedical images. However, this setting suffers from a distinctive bias: answer choices themselves can act as priors, steering multimodal models toward scientifically plausible options even when the figure supports a different answer. We investigate this failure mode through a simple question: what if decoding explicitly discounts what the model would prefer from text alone, so as to favor figure-grounded evidence? To this end, we propose SCICON, a training-free decoding method that scores each candidate by subtracting a text-only option score from its image-conditioned counterpart. Unlike prior contrastive decoding approaches that mitigate hallucinations by contrasting original inputs with distorted images or perturbed instructions, SCICON directly targets the choice-induced prior encoded in candidate text. Across three scientific figure QA benchmarks and three model backbones, SCICON consistently improves accuracy over standard decoding baselines. These results show that decoding against choice-induced priors is an effective and simple way to improve figure-grounded reasoning in scientific MCQA.

When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA

Abstract

Scientific figure multiple-choice question answering (MCQA) requires models to reason over diverse visual evidence, ranging from charts and multipanel figures to microscopy and biomedical images. However, this setting suffers from a distinctive bias: answer choices themselves can act as priors, steering multimodal models toward scientifically plausible options even when the figure supports a different answer. We investigate this failure mode through a simple question: what if decoding explicitly discounts what the model would prefer from text alone, so as to favor figure-grounded evidence? To this end, we propose SCICON, a training-free decoding method that scores each candidate by subtracting a text-only option score from its image-conditioned counterpart. Unlike prior contrastive decoding approaches that mitigate hallucinations by contrasting original inputs with distorted images or perturbed instructions, SCICON directly targets the choice-induced prior encoded in candidate text. Across three scientific figure QA benchmarks and three model backbones, SCICON consistently improves accuracy over standard decoding baselines. These results show that decoding against choice-induced priors is an effective and simple way to improve figure-grounded reasoning in scientific MCQA.

Paper Structure

This paper contains 33 sections, 31 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: An MMSci example where both text-only and multimodal decoding favor a plausible distractor (Option C). SciCon suppresses this text-driven bias and recovers the visually grounded correct answer (Option B).
  • Figure 2: Illustration of SciCon. Given a question and candidate answers, the model produces candidate scores under both multimodal and text-only inputs. SciCon subtracts the text-only score, scaled by $\alpha$, from the multimodal score so that candidates favored mainly by textual prior are suppressed and visually grounded candidates are promoted.
  • Figure 3: A MAC case where greedy decoding selects a text-prior-dominant distractor (Option A). The text-only branch assigns overwhelming probability to A, and the multimodal branch remains biased toward the same incorrect choice. After subtracting the text-only prior, SciCon suppresses A and recovers the visually grounded correct answer (Option B).
  • Figure : Scientific figure example
  • Figure : Scientific figure example