Table of Contents
Fetching ...

IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation

Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu, Yujun Cai, Jing Tang

TL;DR

The paper tackles multimodal context drift in interleaved image–text generation by introducing IUT-Plug, a lightweight plug-in grounded on Image Understanding Trees that provide explicit symbolic grounding. It separates perception from reasoning, updating a dynamic symbolic state to guide both language and image synthesis without retraining base models. A dynamic evaluation framework probes style, logic, and entity consistency, validated on 3,000 expert-annotated samples with high human agreement. Experimental results show robust improvements across diverse VLM–T2I configurations, underscoring the value of neuro-symbolic grounding for maintaining cross-modal fidelity in extended interactions.

Abstract

Existing vision language models (VLMs), including GPT-4 and DALL.E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose IUT-Plug, a module grounded in an Image Understanding Tree (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.

IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation

TL;DR

The paper tackles multimodal context drift in interleaved image–text generation by introducing IUT-Plug, a lightweight plug-in grounded on Image Understanding Trees that provide explicit symbolic grounding. It separates perception from reasoning, updating a dynamic symbolic state to guide both language and image synthesis without retraining base models. A dynamic evaluation framework probes style, logic, and entity consistency, validated on 3,000 expert-annotated samples with high human agreement. Experimental results show robust improvements across diverse VLM–T2I configurations, underscoring the value of neuro-symbolic grounding for maintaining cross-modal fidelity in extended interactions.

Abstract

Existing vision language models (VLMs), including GPT-4 and DALL.E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose IUT-Plug, a module grounded in an Image Understanding Tree (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.

Paper Structure

This paper contains 29 sections, 8 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Inconsistency issues in interleaved VLMs. The images generated on the left show a lack of consistency with the input image, as well as among themselves. In contrast, the image on the right is consistent with the input.
  • Figure 2: Overview of our evaluation metric. The evaluation model is fine-tuned on 3,000 sample data annotated by experts. For each question-answer pair, we use GPT-4o to generate three dynamic evaluation criteria, and then employ the evaluation model to output "yes" or "no" for each criterion.
  • Figure 3: An image generation pipeline for interleaved tasks. The IUT-Plug generates feature text from the question image or images. This feature text is sent to an LLM synthesizer with the answer text to form a prompt. A text to image model then produces the answer image or images. The right panel shows the feature extraction process of the IUT-Plug. It hierarchically extracts key features such as objects attributes and relations from the input images. These features are serialized into a structured JSON format for the LLM. This representation ensures precise grounding and supports dynamic state updates in multi turn interactions.
  • Figure 4: Example 1. Q: A knight and his griffin companion prepare to set off at dawn. A: The knight mounted his griffin, which spread its massive wings, ready to take flight towards the rising sun, its posture full of power.
  • Figure 5: Example 2. Q: An astronaut discovers glowing plants on an alien planet. A: The astronaut stood up, and the scanner in front of her projected a translucent holographic screen displaying complex data about the glowing mushroom.
  • ...and 4 more figures