Table of Contents
Fetching ...

Reliable Thinking with Images

Haobin Li, Yutong Yang, Yijie Lin, Xiang Dai, Mouxing Yang, Xi Peng

TL;DR

Reliable Thinking with Images (RTWI) is proposed, a novel method that estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer.

Abstract

As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.

Reliable Thinking with Images

TL;DR

Reliable Thinking with Images (RTWI) is proposed, a novel method that estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer.

Abstract

As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.
Paper Structure (34 sections, 15 equations, 19 figures, 13 tables)

This paper contains 34 sections, 15 equations, 19 figures, 13 tables.

Figures (19)

  • Figure 1: (a) Noisy Thinking: the TWI paradigm would inevitably suffer from the noisy thinking problem at both the mining and reasoning stages. On the one hand, MLLMs would extract task-agnostic or inaccurate visual cues once the textual CoT generates unreliable tool-calling instructions. Clearly, it is difficult to derive the right answer with incorrect visual cues, e.g., the question is "Find the restaurant closest to me.", but the visual cues either do not depict restaurants or correspond to the non-closest one. On the other hand, even desirable information exists in acquired visual cues, MLLM might generate erroneous textual CoT and then the incorrect answer, e.g., failing to identify the nearest restaurant due to limited distance estimation. (b) Noise Statistics: we investigate the TWI cases in the Vstar and HR4k datasets and observe that NT in either mining or reasoning would lead to wrong answers in real-world scenarios. (c) Observations: we estimate the reliability of the two stages in the TWI process using tool-invocation and reasoning tokens, and then adopt the AUROC metric to quantify the relationship between reliability and reasoning accuracy. As a result, one could observe that traces with correct answers tend to exhibit higher dual-stage reliabilities and reliability leap from mining stage to reasoning stage.
  • Figure 2: Overview of our method RTWI. For clarity, we take single-turn cue mining as a showcase and denote stage reliability $w(t^s)$ as $w^s$ with $s\in \{m,r\}$. Given a multimodal question, RTWI first generates multiple interleaved traces and estimates the reliability of visual cues in the mining stage and textual CoT in the reasoning stage. After that, RTWI identifies and filters the unreliable traces with self-adaptive thresholds. Finally, RTWI assigns higher weights to trustworthy traces based on two carefully-designed principles and then aggregates them to derive the answer.
  • Figure 3: Failure case analyses. "Noisy Mining" and "Noisy Reasoning" indicate the underlying causes of incorrect answers.
  • Figure 4: Test-time Scaling. The first row indicates the accuracy as the budget increases. The second row illustrates the relationship between accuracy and generated tokens under various test-time costs.
  • Figure 5: Scaling law across model sizes.
  • ...and 14 more figures

Theorems & Definitions (3)

  • Definition 3.1: Stage Reliability
  • Definition 3.2: Trace Reliability
  • Definition 3.3: Reliability Leap