Table of Contents
Fetching ...

Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales

Minghe Gao, Shuang Chen, Liang Pang, Yuan Yao, Jisheng Dang, Wenqiao Zhang, Juncheng Li, Siliang Tang, Yueting Zhuang, Tat-Seng Chua

TL;DR

Fact tackles interpretability and hallucination in Multimodal LLMs by distilling executable, code-based rationales that are faithful, concise, and transferable. It generates reasoning traces from executable visual programs and refines them through three editing operations—dynamic pruning, symbolic merging, and logical bridging—before verifying their transferability to end-to-end models and distilling them into MLLMs with a multi-task objective. The training objective combines label loss and rationale loss as $L = L_{label} + \lambda L_{rationale}$ with $\lambda = 1$, while image inputs are standardized to $224\times 224$ for aligned reasoning. Empirical results on tasks such as GQA, OK-VQA, and TallyQA demonstrate improved compositional reasoning and reduced hallucinations across model sizes, showing strong transferability of the program-derived CoTs to downstream vision-language models.

Abstract

The remarkable performance of Multimodal Large Language Models (MLLMs) has unequivocally demonstrated their proficient understanding capabilities in handling a wide array of visual tasks. Nevertheless, the opaque nature of their black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional reasoning tasks is also constrained, culminating in a stagnation of learning progression for these models. In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. This paradigm utilizes verifiable visual programming to generate executable code guaranteeing faithfulness and precision. Subsequently, through a series of operations including pruning, merging, and bridging, the rationale enhances its conciseness. Furthermore, we filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Empirical evidence from experiments demonstrates the superiority of our method across models of varying parameter sizes, significantly enhancing their compositional reasoning and generalization ability. Our approach also reduces hallucinations owing to its high correlation between images and text.

Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales

TL;DR

Fact tackles interpretability and hallucination in Multimodal LLMs by distilling executable, code-based rationales that are faithful, concise, and transferable. It generates reasoning traces from executable visual programs and refines them through three editing operations—dynamic pruning, symbolic merging, and logical bridging—before verifying their transferability to end-to-end models and distilling them into MLLMs with a multi-task objective. The training objective combines label loss and rationale loss as with , while image inputs are standardized to for aligned reasoning. Empirical results on tasks such as GQA, OK-VQA, and TallyQA demonstrate improved compositional reasoning and reduced hallucinations across model sizes, showing strong transferability of the program-derived CoTs to downstream vision-language models.

Abstract

The remarkable performance of Multimodal Large Language Models (MLLMs) has unequivocally demonstrated their proficient understanding capabilities in handling a wide array of visual tasks. Nevertheless, the opaque nature of their black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional reasoning tasks is also constrained, culminating in a stagnation of learning progression for these models. In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. This paradigm utilizes verifiable visual programming to generate executable code guaranteeing faithfulness and precision. Subsequently, through a series of operations including pruning, merging, and bridging, the rationale enhances its conciseness. Furthermore, we filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Empirical evidence from experiments demonstrates the superiority of our method across models of varying parameter sizes, significantly enhancing their compositional reasoning and generalization ability. Our approach also reduces hallucinations owing to its high correlation between images and text.
Paper Structure (16 sections, 3 equations, 5 figures, 5 tables)

This paper contains 16 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: MLLMs exhibit limited proficiency in combinatorial reasoning and spatial understanding. While Fact can significantly enhance their capabilities in performing visual tasks.
  • Figure 2: The pipeline of Fact: 1) Generate executable code from an image and query using a code generation engine and retain code that correctly reasons against expected answers. 2) Simplify code into natural language by pruning irrelevant AST nodes, merging duplicates in symbolic traces, and filling logical gaps to form coherent CoT. 3) Evaluate and filter CoTs for end-to-end model feasibility. 4) Distill refined, accurate CoTs into MLLMs for enhanced adaptability.
  • Figure 3: We use a Python program to explain our editing operation: I) Parse executed code lines into corresponding AST nodes and prune unused loops and conditions, organizing output into a symbolic trace. II) Merge iterated outputs and update variables, converting the symbolic trace to natural language using an LLM. III) Train a small model to identify gaps between statements, filling it with an LLM to complete the logic of the CoT rationale for clarity and coherence.
  • Figure 4: An example of (a) the process that generates CoT rationale for distillation and (b) outputs of MiniGPT4 with Fact.
  • Figure 5: Sources of error in GQA task.